The proposed Paint-CUT is trained and tested on the constructed landscape painting dataset and ChipPhi dataset, including comparison experiments and ablation experiments. The IS and FID of the model results are used to evaluate model performance. The specific experiments are as follows.
  3.3.1. Comparison Experiments
- (1)
- Qualitative Analysis 
In order to verify the effectiveness of the proposed Paint-CUT in generating landscape paintings, MUNIT [
25], NICE-GAN [
26], U-GAT-IT [
27], CycleGAN [
28], and CUT [
22] are selected to compare the generated results with the proposed Paint-CUT.
MUNIT is a multimodal unsupervised image-to-image translation model, which samples from both a content space and style space and reconstructs them to generate final results. NICE-GAN contends a new role of the discriminator by reusing it to encode the images of the target domain. U-GAT-IT incorporates a new attention module and a learnable normalization parameter. The attention module guides the model to focus on important regions, distinguishing between source and target domains. In addition, AdaIN is introduced to control shape and texture changes. CycleGAN is a generative adversarial network for image translation with unpaired data, which consists of two generators and two discriminators in a ring network. In this paper, we compare the above models with the proposed Paint-CUT on the constructed dataset and ChipPhi dataset, respectively, and analyze their generated results.
Firstly, experiments are conducted on the samples of Dwelling in the Fuchun Mountains in the constructed dataset, which generates landscape paintings with the style of Dwelling in the Fuchun Mountains from landscape photos. In total, 1000 landscape photos and 1000 landscape paintings with the style of Dwelling in the Fuchun Mountains are selected as training samples from the constructed dataset. The learning rate is set to 0.0001. Another 100 photo samples are selected for testing. The comparison experiment results of generated landscape paintings are shown in 
Figure 7. The samples of Dwelling in the Fuchun Mountains have elegant ink with an appropriate layout of mountains and water. As shown in 
Figure 7b, the generated results of MUNIT have a similar ink wash style to Dwelling in the Fuchun Mountains. However, the content is quite different from the input landscape photo, which only has a general outline and loses specific texture features. MUNIT samples from both a content space and style space and reconstructs them to obtain the final generated results. Due to the complex structural features and style features of landscape paintings, the generated results of MUNIT are not ideal. The generated paintings of NICE-GAN are shown in 
Figure 7c. Although there is some improvement in content and style compared with MUNIT, some generated results still have incomplete content and generated errors, such as a large blank area in the mountains of the fourth row in 
Figure 7c. The generated results of U-GAT-IT, which incorporate attention and maintain the basic structure and detailed information of the generated landscape paintings, are shown in 
Figure 7d. However, for some photos with indistinct boundaries in 
Figure 7d, the generated landscape paintings by U-GAT-IT have obvious missing parts and the details are not recovered well. The generated results of CycleGAN, which uses cycle consistency loss to ensure the accuracy of translation results, are shown in 
Figure 7e. The landscape paintings generated by CycleGAN are similar to the style of Dwelling in the Fuchun Mountains; however, some details are still missing in the landscape paintings. The generated results of CUT, which uses contrastive loss to replace cycle consistency loss and recovers the main content of the landscape paintings, are shown in 
Figure 7f. However, the detailed information is still missing and the style is not similar enough to the Dwelling in the Fuchun Mountains. The generated results of the proposed Paint-CUT are shown in 
Figure 7g. The SA is used to construct the SA-ResBlock in the Paint-CUT, which enables the model to better capture the internal structural features and effectively retain the texture details of the landscape photos. At the same time, perceptual loss and edge loss are added to learn the style features and modeling of landscape paintings. From the visual results, the generated landscape paintings of ours are similar to the original photos in terms of content, such as the outline of mountains and the texture of rocks. And the results have rich details and the realistic style of Dwelling in the Fuchun Mountains.
In conclusion, the proposed Paint-CUT solves the problem of detail loss, poor style transfer, and blurred outlines in present generated results and can generate landscape paintings with the style of Dwelling in the Fuchun Mountains from photos. The comparison results in 
Figure 7 show that our model generates better results compared with others. The generated landscape paintings not only maintain the layout and content of the target photos but also reflect the characteristics of Dwelling in the Fuchun Mountains with ink wash style, which has better-generated results.
Secondly, experiments are conducted on the samples of A Thousand Li of Rivers and Mountains in the constructed dataset, which generates landscape paintings with the style of A Thousand Li of Rivers and Mountains from landscape photos. A total of 500 landscape photos and 500 landscape paintings with the style of A Thousand Li of Rivers and Mountains are selected as training samples from the constructed dataset. The learning rate is set to 0.00015. Another 100 photo samples are selected for testing. The comparison experiment results of generated landscape paintings are shown in 
Figure 8. The samples of A Thousand Li of Rivers and Mountains are meticulous with the style of blue and green, which describes the beauty of the southern scenery. The generated results of MUNIT are shown in 
Figure 8b. The content of the paintings is quite different from the landscape photos and fails to recover the main scenery in the picture. At the same time, the color is also different from the style of blue and green. Since A Thousand Li of Rivers and Mountains emphasizes delicate brush strokes and the style of blue and green, MUNIT fails to generate satisfactory results. As shown in 
Figure 8c, the generated results of NICE-GAN have more obvious style features than MUNIT; however, the generated landscape paintings miss some elements. The background color of the first row in 
Figure 8c is incorrect and it cannot generate landscape paintings with the style of blue and green well. As shown in 
Figure 8d, some generated results of U-GAT-IT miss a large range of content. For example, the generated scenery in the third row of 
Figure 8d cannot be distinguished. The generated result loses essential structural information and local details also cannot be generated. As shown in 
Figure 8e, the generated results of CycleGAN are similar to the input photos in terms of content. However, the details and lines are blurred and the distribution of color is also unreasonable. The generated results of CUT are better than the above models, which are shown in 
Figure 8f. However, the details are not clear enough and the style is not apparent. The generated results of the proposed Paint-CUT are shown in 
Figure 8g. It can be seen that the proposed Paint-CUT can generate landscape paintings with the style of A Thousand Li of Rivers and Mountains from photos. From the visual results, the generated landscape paintings of ours are similar to the original photos in terms of content. As shown in the second row in 
Figure 8g, the details of the trees in the original photo are maintained in the generated result. And the generated landscape paintings not only maintain the layout and content of the target photos but also reflect the style characteristics of blue and green, which have better-generated results. In conclusion, the proposed Paint-CUT solves the problem of detail loss, poor style transfer, and blurred outlines in generated landscape paintings and can generate landscape paintings with the style of A Thousand Li of Rivers and Mountains from photos. The comparison results in 
Figure 8 show that our model generates better results. The generated landscape paintings are more similar to the original photos in content and target paintings in style.
Finally, experiments are conducted on the samples of the ChipPhi dataset to demonstrate that the proposed Paint-CUT has good generalization ability. In total, 1000 landscape photos and 1000 landscape paintings are selected as training samples from the ChipPhi dataset. The learning rate is set to 0.0001. Another 100 photo samples are selected for testing. The comparison experiment results of generated landscape paintings are shown in 
Figure 9. The scene in the landscape paintings of the ChipPhi dataset is described with an ink brush. As shown in 
Figure 9b, the generated results of MUNIT are similar to the ink wash style of ChipPhi. However, the content of the landscape paintings is deformed and is different from the landscape photos. This suggests that MUNIT cannot generate good results. As shown in 
Figure 9c, the generated results of NICE-GAN basically acquire the ink wash style of ChipPhi; however, the detailed information is not recovered well. Part of the scenery is generated incorrectly due to inconsistent recognition. The generated results of U-GAT-IT are more similar to the ink wash style of ChipPhi, as shown in 
Figure 9d. However, the outline of generated landscape paintings is unclear, and some areas are generated incorrectly. The generated results of CycleGAN are shown in 
Figure 9e; it can be seen that the results of CycleGAN are consistent with the content of the photos and, basically, have an ink wash style. However, the results still lose details (i.e., the mountains in the fourth row). As shown in 
Figure 9f, the generated results of CUT are similar to the input photos in terms of content and have basic scene information. But there are still some problems, such as simple texture, blurred outlines (i.e., distant mountains), and inconsistent overall color. 
Figure 9g shows the generated results of the proposed Paint-CUT; it can be seen that Paint-CUT can generate landscape paintings with the style of the ChipPhi dataset from photos. From the visual results, the generated landscape paintings of ours are similar to the original photos in terms of the content, which maintains the layout and content of the original photos. Therefore, the model leads to better-generated results. In conclusion, the proposed Paint-CUT solves the problem of detail loss, poor style transfer, and blurred outlines in generated landscape paintings and can generate landscape paintings with the style of the ChipPhi dataset from photos. The comparison results in 
Figure 9 show that our model generates better results both in content and style.
In a word, the analysis of generated results shows that the proposed Paint-CUT, based on the constructed dataset and the ChipPhi dataset, solves the problems of detail loss, poor style transfer, and blurred outlines in the present generated results. The SA is used to construct the SA-ResBlock, which enables the model to better capture the internal structural features and effectively retain the texture details of the photos. At the same time, perceptual loss and edge loss are added to learn the style features and modeling of landscape paintings. The generated landscape paintings have rich details (i.e., stone texture), clear outlines (i.e., outlines of distant mountains), and realistic style. In addition, the proposed Paint-CUT generates better landscape paintings both on the constructed dataset and the public ChipPhi dataset compared with other models, which indicates that our model has good generalizability.
- (2)
- Quantitative analysis 
Inception score (IS) [
29] and Fréchet Inception Distance (FID) [
30] are two evaluation metrics for generative models to measure the quality and diversity of generated results. IS calculates the KL-Divergence between the probability distribution of generated images and the real images. A high IS indicates that the generated results have higher quality and diversity. The IS is defined as follows:
          where 
 is expectation, 
 is a distribution encoded by generative model 
, 
 indicates that 
 is an image sampled from 
, 
 is the KL-Divergence between distributions 
 and 
, 
 is the conditional class distribution denoting the probability that image 
 belongs to class 
, and 
 denotes the marginal distribution of class 
.
FID calculates the Wasserstein-2 distance between original images and generated images in a feature space. A low score of FID indicates that the results have better quality. The FID is defined as follows:
          where 
 is the mean of the real image feature vector, 
 is the mean of the generated image feature vector, 
 is the trace of the matrix, 
 is the covariance matrix of the real image feature vector, and 
 is the covariance matrix of the generated image feature vector.
The evaluation metrics are evaluated to compare generated landscape paintings on samples of Dwelling in the Fuchun Mountains; the comparison results are shown in 
Table 2. MUNIT samples from both a content and style space and reconstructs them to generate final results. Due to the complex structure and style features of landscape paintings, the generated results are not satisfactory. MUNIT has the lowest IS and highest FID. The generated landscape paintings of NICE-GAN are similar to the ink wash style; however, the details are not recovered well. Although there are some problems, the generated landscape paintings have improvements both in style and content compared to MUNIT; thus, the NICE-GAN has a higher IS and lower FID than MUNIT. U-GAT-IT incorporates attention to maintaining the basic structure and detailed information of the generated landscape paintings, which can generate paintings with more details than NICE-GAN. Although some brush strokes are still lost, the generated landscape paintings are better than NICE-GAN in terms of content and style. The IS of U-GAT-IT is higher than NICE-GAN and the FID is lower. CycleGAN uses cycle consistency loss to ensure the accuracy of translation results. The generated landscape paintings are similar to the content of landscape photos and have an ink wash style; however, some brush strokes and details are lost. The CycleGAN has a higher IS and lower FID than U-GAT-IT. CUT is used as a baseline model, which uses contrastive loss instead of cycle consistency loss. And the CUT only uses a generator and a discriminator to achieve the image translation. The generated results of CUT are better than other comparison models both in content and style and has a higher IS and lower FID than others. However, the details of the generated landscape paintings are still blurred and the style is unclear by directly using CUT. The Paint-CUT proposed in this paper uses the SA to construct the SA-ResBlock so that the model better captures the internal structural features and effectively preserves the texture details of landscape photos. At the same time, perceptual loss and edge loss are added to learn style features and the modeling of landscape paintings. And then, it can generate landscape paintings of a better and higher quality. Therefore, the generated landscape paintings of Paint-CUT have the highest IS and lowest FID, which is consistent with the qualitative analysis. The comparison results indicate that the proposed Paint-CUT generates better results, which can retain the content of original landscape photos and recover the style of target landscape paintings.
Similarly, the IS and FID of the comparison experiments on samples of A Thousand Li of Rivers and Mountains and the ChipPhi dataset are shown in 
Table 3 and 
Table 4, respectively. It can be seen that the proposed Paint-CUT has the highest IS and lowest FID. In conclusion, the proposed Paint-CUT introduces the SA to construct the SA-ResBlock and adds perceptual loss and edge loss to generate landscape paintings. On the samples of Dwelling in the Fuchun Mountains, samples of A Thousand Li of Rivers and Mountains, and the ChipPhi dataset, our model has the highest IS and lowest FID compared to other models, which is consistent with the qualitative analysis. And it indicates that the constructed Paint-CUT has a stronger ability of feature learning and can generate landscape paintings with a reasonable layout, clear modeling, and a similar style to target landscape paintings.
  3.3.2. Ablation Experiments
Chinese landscape painting focuses on artistic conception and lines, which mainly describe the natural landscape of mountains and rivers. In order to evaluate the resulting improvement by shuffle attention (SA), perceptual loss, and edge loss, ablation experiments are conducted on the proposed Paint-CUT. The SA, perceptual loss, and edge loss are sequentially added to the baseline CUT to study improvements in detail, style, and outlines. The ablation experiment results on the constructed dataset and ChipPhi dataset will be analyzed in detail in this section.
- (1)
- Qualitative analysis 
The ablation experiment results on the constructed dataset of Dwelling in the Fuchun Mountains are shown in 
Figure 10. The baseline CUT generates landscape paintings with the basic style of Dwelling in the Fuchun Mountains in 
Figure 10b; however, the details of the scenery are unclear. For example, the houses and outlines of mountains in the red box are blurred and the ink wash style is not obvious. The generated result is unsatisfactory. Shuffle attention (SA) can capture the detailed features of the main regions in photos. In order to solve the problem of detail loss and focus on the main scenery in landscape painting, SA is added to the generator to construct the SA-ResBlock. As shown in 
Figure 10c, the generated landscape painting has more detailed information than the result of the baseline model, i.e., the texture of mountains has a variation of ink wash and the details of houses are richer. However, these details are still blurred and cannot change according to the scenery characteristics. The perceptual loss (L
per) can better constrain the content and style information of generated landscape paintings. In order to solve the problem of unclear style, we continue to add perceptual loss (L
per) so that the generated landscape paintings are highly consistent with the content of the input photos and the details of scene information become richer. At the same time, the generated results have variations of ink wash. As shown in 
Figure 10d, the ink of the mountains and houses is consistent with the light and dark areas of photos. In addition, lines can guide the generation of landscape paintings. More outline information can be obtained based on the lines of scenery, such as the trend of mountains and waves of water. Meanwhile, rich line information can generate landscape paintings with more detailed features. In order to further improve the quality of landscape paintings and solve the problem of blurred outlines and details, we continue to add edge loss (L
edge) and then construct the proposed Paint-CUT. As the edge loss (L
edge) can constrain the line information of landscape paintings, the generated results have clearer outlines and richer details. As shown in the red box in 
Figure 10e, the details of mountains are clearer and the doors and windows of houses are visible. In conclusion, compared with the baseline model, the ablation experiments in 
Figure 10 prove that the proposed Paint-CUT improves the generation quality of landscape paintings. Specifically, the SA is used to construct the SA-ResBlock, which enables the model to better capture the internal structural features of the landscape photos and effectively retain the texture details. At the same time, the perceptual loss and edge loss are added so that the model can learn the style characteristics and modeling of landscape paintings. Finally, the generated landscape paintings retain the content of photos and have the similar style to Dwelling in the Fuchun Mountains.
The ablation experiment results on the constructed dataset of A Thousand Li of Rivers and Mountains are shown in 
Figure 11. The landscape painting generated by the baseline CUT in 
Figure 11b basically possesses the style of blue and green; however, as shown in the red box, the outlines of the generated landscape painting are not clear and a lot of basic brush strokes are lost while the colors are also vague. SA can capture detailed features of landscape photos. As shown in the red box in 
Figure 11c, after adding the SA, the mountain trend becomes clearer and the texture details become richer. However, some details are still lost. The style of blue and green becomes clearer, but the colors of various scenes are still not accurate enough. The perceptual loss (L
per) can better constrain the content and style information of generated landscape paintings. As shown in the red box in 
Figure 11d, after adding the perceptual loss (L
per), the lost brush strokes in the mountains are recovered and the color distribution is more reasonable. However, the outlines and texture details of mountains are still unclear. The edge loss (L
edge) can constrain the line information of landscape paintings and the generated results have clearer outlines and richer details. The proposed Paint-CUT is finally constructed after adding the edge loss (L
edge). As shown in the red box in 
Figure 11e, the outlines of the mountains become clear and the texture of mountains is generated. In conclusion, based on the proposed Paint-CUT, the ablation experiments in 
Figure 11 prove that the generated landscape paintings retain the content of photos and have a similar style to A Thousand Li of Rivers and Mountains, which improves the generation quality.
In addition, ablation experiments on the ChipPhi dataset are conducted to demonstrate that the SA, perceptual loss, and edge loss in the proposed Paint-CUT are equally necessary for generating landscape paintings on different datasets. The ablation experiment results on the ChipPhi dataset are shown in 
Figure 12. The generated landscape painting of CUT in 
Figure 12b is similar to the content of the input landscape photo. But, as shown in the red box, the generated result has some missing parts in mountains and the overall color of the painting is weak. The SA can capture detailed features of the main regions in photos. As shown in the red box in 
Figure 12c, after adding the SA, the vacancy of mountains is generated and the texture details are richer; however, there are still some details missing. The perceptual loss (L
per) can better constrain the content and style information of the generated landscape paintings. As shown in the red box in 
Figure 12d, after adding the perceptual loss (L
per), the missing part in the mountains is recovered and the generated landscape painting has richer details. However, some textures and lines are still missing. The edge loss (L
edge) can constrain the line information of landscape paintings; the generated results have clearer outlines and richer details. The proposed Paint-CUT is finally constructed after adding the edge loss (L
edge). As shown in the red box in 
Figure 12e, the outlines of mountains become clear and the whole painting is highly similar to the content of the input landscape photo. In a word, based on the proposed Paint-CUT, the ablation experiments in 
Figure 12 prove that the generated landscape paintings retain the content of photos and have a similar style to the target landscape paintings, which improves the generation quality.
In conclusion, the proposed Paint-CUT uses shuffle attention (SA) to construct the SA-ResBlock, which enables the model to better capture the internal structural features and effectively retain the texture details of the landscape photos. At the same time, perceptual loss and edge loss are added to enable the model to learn the style characteristics and modeling of landscape paintings. In summary, the proposed Paint-CUT can generate landscape paintings with clear outlines, rich details, and realistic style, which improves the quality of generated landscape paintings.
In order to further verify the effects of shuffle attention (SA), perceptual loss, and edge loss on generating landscape paintings, the generated results of various models in 
Figure 10, 
Figure 11 and 
Figure 12 are compared by calculating the IS and FID. And the comparison results are shown in 
Table 5, 
Table 6, and 
Table 7, respectively. A higher IS indicates that the generated results have higher quality and diversity. A lower FID indicates that the generated results have better quality.
As shown in 
Table 5, the baseline CUT has the lowest IS and highest FID on the samples of Dwelling in the Fuchun Mountains, which indicates that the CUT cannot generate ideal landscape paintings. Qualitative analysis shown in 
Figure 10 suggests that the generated landscape paintings of CUT have unclear details and blurred outlines and the variation of ink is not obvious. SA can capture the detailed features of the main regions in photos. In order to highlight the details of the scenery, SA is added to the generator to construct the SA-ResBlock. The IS increases and the FID decreases, which indicates that the addition of SA is effective in generating landscape paintings. The perceptual loss (L
per) can better constrain the content and style information of the generated landscape paintings. In order to make the generated landscape paintings highly consistent with the content of input landscape photos, and have the variation of ink, the perceptual loss (L
per) is added on the basis of SA. After adding L
per, the IS increases while the FID decreases, which indicates that the addition of L
per improves the results of landscape paintings. Lines can guide the generation of landscape paintings and rich line information can generate landscape paintings with richer detailed features. The proposed Paint-CUT is finally constructed after adding the edge loss (L
edge). And the generated landscape paintings have the highest IS and lowest FID, which is consistent with the qualitative analysis. The experimental results show that the proposed Paint-CUT has a stronger feature learning ability and better-generated results, which can largely retain the content of landscape photos and, at the same time, learn the style of target landscape paintings.
Similarly, the evaluation metrics of the ablation experiments on samples of A Thousand Li of Rivers and Mountains and the ChipPhi dataset are shown in 
Table 6 and 
Table 7, respectively. It can be seen that after adding the SA to construct the SA-ResBlock, the model better captures the internal structural features and effectively retains the texture details of the landscape photos. At the same time, the perceptual loss (L
per) and edge loss (L
edge) are added to construct the proposed Paint-CUT, which can learn the style characteristics and modeling of landscape paintings. The generated results retain the content of landscape photos and have a similar style to the target landscape paintings, which improves the generation quality. The results of our model have the highest IS and lowest FID compared to other models, which is consistent with the qualitative analysis. In conclusion, the constructed Paint-CUT can generate landscape paintings with clear outlines, rich details, and a similar style to target landscape paintings, which improves the quality of generated landscape paintings.