You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

21 May 2025

Advancing Traditional Dunhuang Regional Pattern Design with Diffusion Adapter Networks and Cross-Entropy

,
,
and
1
Department of Space and Culture Design, Kookmin University, Seoul 02707, Republic of Korea
2
Department of Smart Experience Design, Kookmin University, Seoul 02707, Republic of Korea
3
Department of Global Convergence, Kangwon National University, Chuncheon-si 24341, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Entropy in Machine Learning Applications, 2nd Edition

Abstract

To promote the inheritance of traditional culture, a variety of emerging methods rooted in machine learning and deep learning have been introduced. Dunhuang patterns, an important part of traditional Chinese culture, are difficult to collect in large numbers due to their limited availability. However, existing text-to-image methods are computationally intensive and struggle to capture fine details and complex semantic relationships in text and images. To address these challenges, this paper proposes the Diffusion Adapter Network (DANet). It employs a lightweight adapter module to extract visual structural information, enabling the diffusion model to generate Dunhuang patterns with high accuracy, while eliminating the need for expensive fine-tuning of the original model. The attention adapter incorporates a multihead attention module (MHAM) to enhance image modality cues, allowing the model to focus more effectively on key information. A multiscale attention module (MSAM) is employed to capture features at different scales, thereby providing more precise generative guidance. In addition, an adaptive control mechanism (ACM) dynamically adjusts the guidance coefficients across feature layers to further enhance generation quality. In addition, incorporating a cross-entropy loss function enhances the model’s capability in semantic understanding and the classification of Dunhuang patterns. The DANet achieves state-of-the-art (SOTA) performance on the proposed Diversified Dunhuang Patterns Dataset (DDHP). Specifically, it attains a perceptual similarity score (LPIPS) of 0.498, a graph matching score (CLIP score) of 0.533, and a feature similarity score (CLIP-I) of 0.772.

1. Introduction

Image generation based on deep learning has been a widely researched topic in the field of computer vision. It focuses on generating various types of images based on the provided modal conditions. For example, text-to-image generation create semantically consistent patterns based on natural language descriptions. Applying deep learning to the generation of Dunhuang patterns not only supports the preservation and inheritance of traditional culture, but also offers rich inspiration and resources for modern design. From a design studies perspective, this technology enables designers to conduct an unprecedented in-depth exploration of traditional visual languages, thereby deconstructing, recombining, and innovating these historical motifs within contemporary design contexts. Furthermore, the application of artificial intelligence in this domain can foster a new synergy between algorithmic generative capabilities and the designer’s creative intuition, leading to innovative design outcomes that are both culturally resonant and forward-looking. Dunhuang patterns have great application prospects in the fields of virtual reality, computer-aided design, unified generation patterns, and style migration. Deep learning can analyze the data and feature extraction of Dunhuang patterns to generate corresponding datasets. This not only improves the accuracy of pattern recognition but also speeds up the construction of digital pattern libraries. In addition, cross-entropy loss is a key component in deep learning network training. It helps the model learn and optimize by minimizing the gap between the predicted probability distribution and the true distribution, thereby improving performance and generalization. In summary, AI can generate patterns of the same style based on the dataset, providing new possibilities in the field of creative design.
Image generation not only creates new visual content from existing images but also generates entirely new images based on different types of data, such as text, sketches, and scene graphs [1]. The development of this technique is important for several application areas, such as data enhancement and pattern recognition [2]. To address the problem that a single input image may correspond to multiple outputs, Jun-Yan Zhu et al. [3] modeled the distribution of possible outputs using a conditional generative model. They mapped the ambiguities into low-dimensional latent vectors and randomly sampled them during testing. This method effectively prevents the pattern collapse problem. It improves the diversity of results by explicitly encouraging a bidirectional mapping between outputs and potential latent codes. To generate photo-realistic images from semantic descriptions, a hierarchically nested adversarial objective was introduced. This objective helps standardize intermediate representations and assists the generator in capturing complex image statistics. This approach improves semantic consistency and image fidelity by using a multi-stage generator architecture. It also employs a multi-purpose adversarial loss, which increases the efficient use of both image and text information [4]. SPADE [5] is used to generate realistic images from semantic layouts. Traditional network architectures often struggle to process semantic masks effectively. To address this, a spatially adaptive normalization layer is proposed. It modulates the activations in the normalization layer through spatially adaptive transformations, allowing semantic information to be conveyed more efficiently. This method demonstrates its superiority on several challenging datasets and supports multimodal and style-guided image synthesis. A new evaluation metric, Fréchet Joint Distance (FJD) [6], has been proposed for evaluating conditioned generative adversarial networks (CGANs). FJD implicitly measures image quality and condition consistency. It does so by calculating the Fréchet distance between the joint distribution of the image and its condition, as well as internal condition diversity and other properties. This metric provides a promising unified metric for CGAN model selection and benchmarking. These research efforts [7,8,9] have promoted the development of image generation techniques. They also offer new perspectives and tools for evaluating and comparing different models. In China, a deep learning-based image restoration technique applies generative adversarial networks (GANs) to restore damaged textile images excavated from Chu tombs [10]. This approach avoids direct contact with the artifacts, reducing the risk of damage, while also providing a new avenue for artifact restoration and research. There are several challenges when applying image generation to Dunhuang patterns. The main ones include accurately recovering complex and irregular textures, ensuring the coherence and consistency of the reconstructed pattern, and accurately predicting missing details when substantial portions of the original are unavailable.
To address the above challenges, this paper proposes the Diffusion Adapter Network. The model provides structural guidance to a large-scale text-to-image model, enabling multimodal Chinese traditional pattern generation. In the field of design science, this study combines deep learning with diffusion model innovation. It opens up a new path for the preservation, innovation, and modern design application of traditional cultural symbols. At the theoretical level, this study introduces advanced AI technologies to enhance design science, challenge traditional design frameworks, and explore new creative directions. Taking Dunhuang pattern generation as an example, we propose the Diffusion Adapter Network (DANet), which effectively integrates multimodal information. This model enables the efficient generation of complex traditional patterns and provides new theoretical perspectives and methodologies for design studies. In practice, this method offers designers a powerful tool to deeply explore the meaning of traditional cultural symbols, accurately extract key features, and reinterpret them in a modern context. It not only helps to supports the preservation of endangered patterns, but also stimulates creative inspiration, fosters culturally rich and innovative designs, and promotes diversity in the design industry. This paper has four main contributions:
(1)
We propose a new model, DANet, which integrates textual input with sketch information to efficiently generate Dunhuang patterns. The multihead attention mechanism in the attention adapter processes multiple subspaces in parallel, with each subspace focusing on different features. This allows the model to capture both global structures and fine-grained details simultaneously, enhancing the accuracy and precision of the generated patterns.
(2)
In order to extract multiscale feature information of a Dunhuang image, the MSAM module simultaneously considers features at different resolutions to provide a more comprehensive image understanding. It not only recognizes large contours and shapes but also captures small details and textures.
(3)
The ACM module adaptively adjusts the generation guidance coefficients of different feature layers. This dynamic balancing improves the contribution of different feature layers, improving both the accuracy and efficiency of the generation process.
(4)
For the Dunhuang pattern generation task, cross-entropy loss is introduced as an auxiliary supervision signal to enhance the semantic understanding capability of the attention adapter. This compensates for the limited semantic learning caused by freezing the parameters of the diffusion model. As a result, the generated Dunhuang images exhibit improved semantic consistency and accuracy.

3. Methodology

3.1. Overall Structure of DANet

The overall structure of the DANet is shown in Figure 1. Based on the diffusion macromodel SD1V5, this paper further proposes an attention adapter, which can realize multimodal generation of Dunhuang patterns by fusing text information, sketch structure, and other conditional information. Firstly, we input text information and sketch data (Table 1) and process the text description using CLIP text encoder, while the sketch data are processed by VAE encoder. The processed sketch data are trained using the attention adapter, in which the multihead attention module (MHAM) is used to enhance the cue information of image modality, the MSAM module features images at different resolutions for more comprehensive image understanding, and the ACM module adaptively adjusts and optimizes the model mechanism. After further processing, the image and text information are feature extracted and processed through the UNet structure. Finally, the VAE decoder is utilized to output our desired Dunhuang patterns (Figure 2). In this process, there are freezing parameters in the CLIP text encoder and VAE decoder, which indicate the parts that remain unchanged during training. There are trainable parameters in the attention adapter that represent the update of sketch modality.
Figure 1. Overall structure of the DANet.
Table 1. Input sample diagram.
Figure 2. Qualitative-quantitative fusion flowchart.
To address the limited semantic understanding of Dunhuang patterns caused by freezing the parameters of the pre-trained diffusion model, we enhance the attention adapter’s ability to capture domain-specific features in two key ways. Specifically, we introduce a multihead attention module (MHAM) and a multiscale attention module (MSAM) to effectively extract structural and fine-grained information from both sketches and textual cues. MHAM improves semantic focus by attending to multiple subspaces in parallel, enabling the extraction of pattern-critical features. Meanwhile, MSAM ensures the perception of details across different spatial scales. Together, these modules strengthen the adapter’s capacity to perceive and represent the distinctive characteristics of Dunhuang patterns. Furthermore, we introduce a semantic classification supervision mechanism by incorporating a cross-entropy loss during the training of the attention adapter. Given the diverse themes and stylistic variations in the Dunhuang pattern dataset, this supervision encourages the model to perform fine-grained semantic discrimination in the feature space. As a result, the model becomes more capable of distinguishing and understanding the specific semantics associated with each pattern type. This enhances the adapter’s representational precision. Notably, since the underlying diffusion model has been pre-trained on large-scale image–text pairs, it already possesses strong visual-linguistic understanding, which our method effectively leverages. Based on this, the unique semantics of Dunhuang patterns can be effectively captured by training a lightweight adapter on the dataset proposed in this paper. This approach efficiently leverages the visual-linguistic knowledge and structural priors of the pre-trained model, enabling fast and economical adaptation to Dunhuang-specific features. In addition, the generalized visual-semantic knowledge acquired through the pre-training of the base model is retained and remains robust against interference from limited domain-specific data. This ensures that the generated patterns maintain both broad generalization and distinct Dunhuang-specific characteristics. As a result, it strikes a practical balance between computational cost, training efficiency, and generation quality.
From a design perspective, the DANet model accurately generates culturally accurate traditional patterns by fusing sketches and textual inputs. This approach not only respects the designer’s creative input (e.g., initial sketches or conceptual descriptions), but also provides a new design tool in the creative design process: designers can efficiently visualize and explore cultural symbols. Especially in traditional pattern design, the accurate expression of the cultural semantics and formal characteristics behind the patterns is crucial. By precisely regulating the attention mechanisms, DANet enables designers to better control cultural fidelity and visual expression in the creative process. This effectively enhances the integration of design creativity and cultural meaning.

3.2. Attention Adapter

We first utilize the multihead attention module (MHAM) to process the input sketch. After many attempts, this method chooses the 8-head attention mechanism to divide its channels into 8 attention heads. The 8-head attention mechanism enables the model to attend to multiple feature subspaces simultaneously. Moreover, it allows for an even distribution of channel dimensions, preventing over- or under-partitioning that may result in unnecessary computational overhead. Given the query features, the output of multihead attention can be calculated by Equations (1) and (2),
X = A t t e n t i o n S ,   O , V = s o f t m a x S O T d · V
M u l t i H e a d S , O , V = C o n c a t h e a d 1 , , h e a d h W P w h e r e   h e a d i = A t t e n t i o n ( S W i S , O W i O , V W i V )
where S = X W s , O = X W o , V = X W v , are the query, key, and value matrices of the attention operation, respectively, W s , W o , W v are the weight matrices of the trainable linear projection layer, and W P is the learnable linear transformation matrix. Based on this design, each head may focus on a different part of the input and can represent more complex functions than weighted averages. (See Figure 3).
Figure 3. Structure of Attention Adapter.
Following the multihead attention module (MSAM), the output features are processed by a GSC layer and subsequently passed through three residual downsampling blocks; this sequence facilitates the extraction of multiscale features. The GSC layers are group normalization (GN), SiLU activation, and convolutional fusion, which extend the number of feature channels with the same size to 320. The GN layer splits features into eight groups and applies mean-variance normalization within each group. This design mitigates the influence of varying batch sizes and ensures stable training even under small-batch conditions. It then goes through the Swish function in the form of Equation (3). As shown in Figure 4, this activation function provides a smoother gradient flow compared to the traditional RLU activation function, which helps to improve the training process and performance of the deep learning model. Finally, a specific convolutional layer is used to maintain the output size after convolution. This allows the network to capture more detailed features and enhances the model’s generalization and performance.
f z = z × s i g m o i d ( z )
Figure 4. Swish vs. Sigmoid.
The features are then passed through a downsampling block to extract additional multiscale information. By progressively reducing the spatial dimensions, the model captures features at different scales, enhancing its perception and improving generalization. Each of these downsampling blocks is designed with a similar structure, consisting of two residual network layers and a downsampling convolutional layer, all with a downsampling rate of 2. After a layer of downsampling, the feature size is reduced by a factor of 2, and there are differences only in the input and output channels, and the output features are in the order of 640, 1280, and 1280. The process of compressing the feature F C 1 of 64 × 64 size into a feature F C 4 of 8 × 8 size again forms a multiscale feature, F C = { F C 1 , F C 2 , F C 3 , F C 4 } . F C has the same size as the intermediate feature F e n c of the diffusion model and is finally added one by one at the same scale.
In the adaptive conditioning mechanism (ACM), there are four levels of intermediate features in the downsampling process of the UNet encoder, and the features of different layers have different impacts on the generation results. Therefore, this study introduces an adaptive adjustment mechanism for the conditional guidance weights of each feature layer. During training, the network automatically adjusts these weights via backpropagation, enabling the attention adapter to flexibly fuse features. This ensures effective structural control using various inputs such as sketches and text. In summary, multiscale feature extraction and conditional control can be defined as Equations (4) and (5),
F C = F m a ( )
F ¯ i e n c = F e n c i + W i × F C i ,   i { 1 ,   2 ,   3 ,   4 }
where C represents the features extracted by the VAE, F m a represents the adapter network, F C i represents the multiple intermediate features extracted, W i represents the corresponding different weights, and F e n c i is each intermediate layer feature.

3.3. Cross-Entropy Loss

During the optimization process, the SD1V5 parameters are frozen and only the attention adapter is trained, using the same optimization objective as SD1V5. Each training sample is a ternary consisting of the original image Z , the condition map C , and the prompt text y. Given an image Z , it is first embedded into the potential space X by the encoder of the VAE. Then, a time step t long, X , is randomly sampled from [ 0 , T ] and the corresponding noise is added to obtain X t . The noise prediction network is fixed by Equation (6) to optimize the ability of the modal information extracted by the attention adapter through the prediction of the noise residuals, which is denoted as φ , the domain-specific encoder; F C is the modal cue information, h θ is the noise prediction network, and h is the actual noise. If this function is weighted, which is achieved by introducing a weighting factor , the weighted loss function can be expressed as Equation (7), where θ is a scalar greater than 0 used to adjust the weights of the loss function.
L m a = Ε X 0 , t , F C , h ~ N ( 0,1 ) [ h h θ X t , t , φ y , F C 2 2 ]
L m a θ = θ · Ε X 0 , t , F C , h ~ N ( 0,1 ) [ h h θ X t , t , φ y , F C 2 2 ]
In diffusion modeling, temporal embedding is an important condition for sampling. The main content of the generated results is determined in the early sampling stage, the features are inserted into the first feature fusion stage, and the t-distribution of the cubic function t = 1 t T 3 × T , t ( 0 , T ) is used to increase the probability of the pre-sampling drop. Because the noise intensity of randomly sampled images may be uncontrolled, in this paper, the noise is regularized by dividing it by the variance to prevent the noise intensity from exceeding the bounds, the predicted image is then obtained by directly dividing the noise, and the difference between the original image and the predicted image is then calculated.

4. Experiments

4.1. Dataset and Metrics

In this paper, we build our own Dunhuang patterns image dataset DDHP, which has 3000 pattern images divided into 2800 training sets and 200 test sets, as shown in Figure 5, and each image has semantic masks, sketches, descriptive texts, and images with different backgrounds. This method uses its pattern images, sketches, and texts, crops the images to 512 × 512 size, and randomly selects corresponding pattern images from 10 articles in this description to enhance the text diversity. Due to the specificity of the patterns, the experiments use the conditions of the sketch for multimodal pattern generation. Figure 6 and Figure 7 show the effect of the mixed cross-generation of textual data and modeling information, where the images on the diagonal line are the results of the matching textual modal pairs of guided generation, and the rest are the results of the confusing guided generation. The 3000 hand-drawn sketches represent the basic structural features of Dunhuang motifs, and each sketch (Table 2) is accompanied by a hand-written descriptive text covering the elements of the motifs, stylistic imagery, and cultural meanings, such as “Flying maidens encircling a lotus flower center,” “Lianzhu pattern combined with a flame totem structure,” and so on. These texts were compiled by design researchers with reference to the classification of Dunhuang motifs.
Figure 5. Schematic diagram of the DDHP dataset.
Figure 6. Multimodal information cross-generation effects generated by single sketch and text.
Figure 7. Multimodal information cross-generation effects generated by multi-sketch and text.
Table 2. Sample sketches and text presentation.
In this paper, the validation is carried out on the test set of DDHP, whereby 2000 patterned graphic pairs of the test set are selected for validation testing, and the original images, sketches, and descriptive texts are chosen, and for each sample, different models are randomly sampled once as the final results. The inter-image similarity evaluation index typically uses the perceptual similarity LPIPS column (learned perceptual image patch similarity), while graphic matching and image feature matching of the generated results are used as evaluation indexes, with the graphic matching degree CLIP score (ViT-L/14) and feature similarity CLIP-I used, respectively, to qualitatively evaluate the generation effect of the attention adapter model. The LPIPS is calculated as shown in Equation (8),
d z , z 0 = l 1 H l W l h , w W l ( y h w ^ l l y 0 h w ^ l ) 2 2
where z generates the image, z 0 is the original image, and y h w ^ is the neural network that extracts the output of each layer for activation after normalization of the features. This paper uses the AlexNet network, and after w layers of weights re-multiplication, it calculates the L2 distance and then averages it. A lower score indicates that the quality of the generated image is closer to the original image.
Since the LPIPS score does not measure whether the generated image matches the text description and modal information cues, this paper adopts the CLIP score and CLIP-I as the evaluation metrics. The CLIP score uses the CLIP model to extract and measure the cosine similarity between the text features and the image features to evaluate the graphic matching, which is computed as shown in Equation (9),
C L I P S c o r e I , C = m a x   ( W × cos G l , G c , 0 )
where G c and G l are the features output by the CLIP encoder for text and image processing, W is a constant taking the value of 2.5, and a higher CLIP score represents a better consistency between the generated image and the text.
CLPI calculates the feature similarity between the generated image and the cued image to evaluate the similarity and is calculated as shown in Equation (10),
C L I P I I 1 , I 2 = m a x   ( cos I 1 , I 2 , 0 )
where I 1 and I 2 represent the features of the generated image and the cued legend, respectively, and a higher CLIP-I represents the more similar features of both.

4.2. Ablation Experiments

In this paper, two guided generation methods, sketch and text, are selected for multimodal Dunhuang pattern generation. Using 2000 images from the DDHP test set for the evaluation, for each image, different methods are randomly extrapolated only once as the final result. For the effect of the model in generating Chinese traditional pattern graphics, the analysis of the experimental results will be unfolded from the ablation experiments in the following.
To verify the effectiveness of each module, this paper adds each module of the attention adapter one by one on the basis of the SD1V5 model and constructs ablation experiments on the DDHP test set. For a fair comparison, the ablation experiments use the same training configuration and number of iteration steps, and the results are shown in Table 3.
Table 3. Results of ablation experiments on the DDHP dataset.
Compared with the base pedestal model, the model gains about an 8% improvement in the CLIP-I score after adding the MHAM module, indicating that the multihead attention mechanism helps to improve the model’s ability to extract and fuse key features. In addition, the base model with the MSAM module gains about 7.3% improvement in the CLIP-I score, indicating that multiscale feature injection fusion is not only structurally aware of model generation but also contributes to graphical consistency. When the ACM module is introduced on this basis, both the CLIP score and CLIP-I score indicators are partially improved, indicating that the ACM module can optimize the feature adjustment weights to effectively improve the effect. When both the MHAM module and MSAM model are introduced to the base model, the scoring performance is improved, indicating that the two modules are complementary to each other to promote performance improvement. Finally, when using three modules, that is, the complete attention adapter model, the best performance is achieved in the CLIP score and CLIP-I scoring indexes, which indicates that the other two modules together with the ACM module can adaptively adjust the weights of different feature injection layers to obtain better results.
In addition, to evaluate the effectiveness of semantic supervision using cross-entropy loss, we compare it with a variant of the DANet trained using L1 loss. The results show that the DANet (L1 loss) achieves a CLIP score of 0.521 and a CLIP-I of 0.745. As a pixel-level supervision signal, L1 loss emphasizes low-level detail reconstruction rather than high-level semantic categorization. While this may improve the realism of generated images at the pixel level, it often leads to weaker semantic consistency in the generated patterns. The experimental results show that DANet supervised with cross-entropy loss achieves higher CLIP score and CLIP-I values compared to L1 loss. This indicates better performance in terms of semantic and categorical accuracy. Cross-entropy loss provides more effective supervision for the attention adapter, enhancing its ability to capture Dunhuang-specific semantics and significantly improving the consistency and accuracy of the generated patterns.

4.3. Comparative Experiments

For a fair comparison, all the compared models are fine-tuned on the DDHP training set. In the case of collaborative diffusion, the same training data are used as in this paper, but without additional fine-tuning. Under the same conditions, each method randomly generates 1 image per input, and the evaluation metrics are computed as the average of over 2000 generated images. We compare our proposed method with several representative mainstream approaches, including collaborative diffusion, the original SD1.5, ControlNet, as well as GAN- and CGAN-based methods that rely on the efficient fine-tuning of large models. In addition, to evaluate the semantic modeling and generation capability of our proposed adapter for Dunhuang images under a frozen large-scale diffusion model, we conducted a comparison with the diffusion model fine-tuned using LoRA on SD1.5. LoRA achieves lightweight fine-tuning by introducing low-rank matrices to a subset of parameters, which is similar in spirit to the training strategy used by DANet.
In the DDHP dataset, each input (textual description as well as the corresponding sketch) is accompanied by a real target pattern (i.e., Ground Truth). These images are used in both the training and evaluation of GAN-like models. For diffusion-based models, the method is used as much as possible in the evaluation phase. The performance comparison results of the DANet with current mainstream methods are shown in Table 4. Its score reaches the best in terms of LPIPS score performance, especially for the image of Dunhuang patterns generated by using sketch guidance, which is reduced from 0.532 to 0.498, with a performance improvement of about 6.4% compared to the GAN and about 31.7% compared to the CGAN. This fully indicates that the generated pattern guided by the DANet model is most similar to the original image. It performs better in guiding image generation when structural information is available. Moreover, it can more accurately guide large models to generate images that meet multimodal constraints.
Table 4. Comparative experimental results on the DDHP dataset.
In this paper, we use the CLIP model to compute the generated image and modal information feature consistency, and the performance of the CLIP score shows that the DANet performs the best and the GAN scores the lowest. This indicates that compared with the conventional diffusion model dedicated to images, the generalized diffusion macromodel trained from massive data has a strong generative ability and significantly outperforms the GAN model in the graphic matching score. Compared to the CGAN, the DANet model score grows from 0.512 to 0.533, with a performance improvement of about 4.1%. The improvement is about 33.9% compared to the GAN, and the DANet score also achieves some improvement compared to other mainstream methods, showing the best performance. This indicates that the DANet performs better in terms of graphical consistency between the prompt text and the generated image on pattern image generation, and it can better balance the textual semantic description and structural modal information to guide the generation of the corresponding pattern features. From the CLIP-I score measurements, the DANet achieves the highest score superior to other mainstream methods. Based on the sketch generation effect compared with the GAN, the CLIP-I score improves from 0.649 to 0.772, which is a performance improvement of about 18.9% and a performance improvement of about 14.9% compared with the CGAN. This reflects that the DANet-guided generated pattern image is not only similar in pixel space to the original pattern image providing structural modal information but also more similar in the depth features extracted by CLIP, and the DANet can more effectively utilize the depth features of structural modal information for guided image generation. In conclusion, the DANet model proposed in this paper can fully utilize the conditional information and achieve the best results in all the tested metrics.
In addition, from Table 4, it can be seen that in terms of image generation efficiency, all experiments were inferred on a single NVIDIA A100 GPU (40 GB). The GAN-based method has a very high inference speed because the generation process relies on only one forward propagation without multi-step iterative sampling. Its FPS ranges from 1.8 to 12.5, and the inference speed gradually becomes slower as the network structure becomes more complex. Traditional GANs have an FPS of up to 12.5, while BigGANs and CGANs with more parameters have an FPS of 2.2 and 1.8, respectively. Compared to GANs, the diffusion model-based approach has a slower inference speed. It needs to generate images through a sampling process of gradual denoising, so the inference speed is limited. In order to generate higher-quality Dunhuang images, the DDIM sampler is used in the experiments, and the number of sampling steps is set to 50. It can be seen that the diffusion model has an FPS of 0.52, and the LoRA-Baseline based on diffusion fine-tuning has an FPS of 0.50. The DANet’s inference speed decreases compared to the previous two due to the introduction of the additional attention adapter structure, which is 0.42. LoRA just merges additional small matrices and thus hardly increases the inference cost of the diffusion model. It improves the CLIP Score and CLIP-I of SD while ensuring that the inference speed is not reduced. The attention adapter is an additional module that targets the Dunhuang pattern generation task by utilizing multihead attention and multiscale features to extract key information introduced from the outside, instead of only low-rank updates to the original model parameters. Cross-entropy loss is also specifically introduced to categorize the semantics. As a result, its inference speed is lower than LoRA-diffusion. However, it can be seen that for Dunhuang patterns, the DANet is more expressive in terms of semantic guidance and stylistic consistency. This proves that the performance of the attention adapter is improved before keeping the computational overhead low.
Table 5 shows the visualization results of the DANet, PGGAN, and LoRA-Baseline under the joint guidance of sketches and text. Among them, the first column is the sketch drawn manually, and the second column is the text description used for semantic guidance. It can be seen that the DANet better preserves the structural details in the sketches, such as the petal shape of the lotus flower and the hand connection relationship of the flying dancer. At the same time, it is able to creatively deal with the color layer changes of the lotus flower to make it look more vivid. The generation effect of LoRA-Baseline can be seen from its poor understanding of semantics, failing to capture the context of Dunhuang to design the sketch. In contrast, the PGGAN creates a more cartoonish image of the dancing girl drawing style that originally symbolizes the characteristics of Dunhuang. This suggests that the DANet performs better in stylistic consistency and spatial semantic alignment. It is able to more accurately realize the Dunhuang-style compositions described in the text. The attention adapter proposed in the DANet is superior in the integration and expression of Dunhuang pattern styles.
Table 5. Visualization results for different models.

5. Conclusions

In this paper, DANet and cross-entropy loss are proposed to realize the multimodal cueing pattern picture capability of the diffusion macromodel of Vincennes images. The core design of the attention adapter proposed in this paper is based on a multihead attention strategy and a multiscale attention module. These modules extract key visual modal cues and multiscale features for multimodal attention, while the diffusion-based large model provides structural guidance to enable the accurate generation of Dunhuang patterns. Furthermore, in the DANet in this paper, the cross-entropy loss function is introduced. The Softmax function is first used to convert the output of the model into a probability distribution, and then the cross-entropy loss function is used to measure the difference between this predicted probability distribution and the true probability distribution. This combination not only makes it easy to calculate the gradient but also provides good numerical stability during the optimization process. The results of the ablation and comparison experiments demonstrate that the proposed attention adapter outperforms mainstream models in both the semantic guidance and structural consistency of image generation. In addition, it consumes fewer training resources than the other comparison models. The attention adapter proposed in this study demonstrates strong generality and generalization ability. It can guide exemplar generation by combining multiple attention modules. It is not only compatible with custom models in the same architecture but also integrates seamlessly with existing style-controlled generation tools. This feature makes the attention adapter not only suitable for the image generation of Dunhuang patterns but also can be widely used to guide the generation of general-purpose objects, which broadens its scope of application. In the aspect of design theory, this study for the first time integrates qualitative design methods with quantitative deep learning techniques, proposing a new way to empower design creativity with technology.
Despite the above contributions of the model proposed in this paper, it is currently mainly targeted at Dunhuang motifs and has not yet been validated on other traditional arts. In addition, the use of English text cues may limit the direct application of the model in Chinese scenarios, requiring users to have a certain level of English proficiency. The improvement based on the diffusion model makes the inference of the DANet slower; although within acceptable limits, it still needs to be improved in practical applications.
From a design methodological perspective, this study is the first to integrate qualitative design approaches with quantitative deep learning techniques, presenting a novel framework for enhancing creative design through technological means. At the same time, it also provides a new vision and practice example for future research on the cross-fertilization of design theory and intelligent technology. However, at present, this method mainly uses English text prompts, and the pattern images generated are mainly Dunhuang patterns. It is still necessary to adjust the model for the demands of the Chinese environment and generating images of Dunhuang patterns, and there may be some limitations in applying it to related fields in China. In the future, our goal is to explore how to apply this technology more effectively in domestic public security-related fields, try to integrate more dimensions of modal information fusion to guide image generation, change the language of textual modal information, replace English with Chinese descriptions, optimize the base generation model, use the base model of the Dunhuang patterns for generation, and fine-tune the base model to achieve more practical applications.

Author Contributions

Conceptualization, Y.T.; methodology, T.Y.; software, Z.C. and T.Y.; validation, Y.T. and S.L.; formal analysis, T.Y. and Z.C.; investigation, Y.T., T.Y., and Z.C.; resources, T.Y.; writing—original draft preparation, Y.T.; writing—review and editing, Y.T. and S.L.; visualization, Y.T. and S.L.; supervision, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are included in this article. The raw data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank all of those who supported us in this work. Thanks to the reviewers for their comments and efforts to help improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, Z. (HTBNet) Arbitrary Shape Scene Text Detection with Binarization of Hyperbolic Tangent and Cross-Entropy. Entropy 2024, 26, 560. [Google Scholar] [CrossRef] [PubMed]
  2. Chen, Z.; Yi, Y.; Gan, C.; Tang, Z.; Kong, D. Scene Chinese Recognition with Local and Global Attention. Pattern. Recognit. 2025, 158, 111013. [Google Scholar] [CrossRef]
  3. Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  4. Zhang, Z.; Xie, Y.; Yang, L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognitionm, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  5. Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  6. DeVries, T.; Romero, A.; Pineda, L.; Taylor, G.W.; Drozdzal, M. On the evaluation of conditional GANs. arXiv 2019, arXiv:1907.08175. [Google Scholar]
  7. Shen, Y.; Liang, J.; Lin, M.C. GAN-based garment generation using sewing pattern images. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings; Part XVIII 16; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  8. Sha, S.; Wei, W.T.; Li, Q.; Li, B.; Tao, H.; Jiang, X.W. Textile image restoration of Chu tombs based on deep learning. J. Silk 2023, 60, 1–7. [Google Scholar]
  9. Chen, Z. Graph Adaptive Attention Network with Cross-Entropy. Entropy 2024, 26, 576. [Google Scholar] [CrossRef] [PubMed]
  10. Orteu, J.-J.; Garcia, D.; Robert, L.; Bugarin, F. A speckle texture image generator. In Proceedings of the Speckle06: Speckles, from Grains to Flowers, Nimes, France, 13–15 September 2006; Volume 6341. [Google Scholar]
  11. Adibah, N.; Noor, N.M.; Suaib, N.M. Facial Expression Transfer using Generative Adversarial Network: A Review. IOP Conf. Ser. Mater. Sci. Eng. 2020, 864, 012077. [Google Scholar] [CrossRef]
  12. Yan, B.; Zhang, L.; Zhang, J.; Xu, Z. Image Generation Method for Adversarial Network Based on Residual Structure. Laser Optoelectron. Prog. 2020, 57, 181504. [Google Scholar] [CrossRef]
  13. Denton, E.L.; Chintala, S.; Fergus, R. Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 1486–1494. [Google Scholar]
  14. Belén, V.-M.; Rubio-Escudero, C.; Nepomuceno-Chamorro, I. Generation of synthetic data with conditional generative adversarial networks. Log. J. IGPL 2022, 30, 252–262. [Google Scholar]
  15. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  16. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
  17. Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar]
  18. Brock, A. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
  19. Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
  20. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
  21. Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
  22. Chen, Z. Arbitrary Shape Text Detection with Discrete Cosine Transform and CLIP for Urban Scene Perception in ITS. IEEE Trans. Intell. Transp. Syst. 2025. early access. [Google Scholar] [CrossRef]
  23. Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  24. Voss, A.; Voss, J. Fast-dm: A free program for efficient diffusion model analysis. Behav. Res. Methods 2007, 39, 767–775. [Google Scholar] [CrossRef] [PubMed]
  25. Wagenmakers, E.J.; Van Der Maas, H.L.; Grasman, R.P. An EZ-diffusion model for response time and accuracy. Psychon. Bull. Rev. 2007, 14, 3–22. [Google Scholar] [CrossRef] [PubMed]
  26. Li, X.; Liu, Y.; Lian, L.; Yang, H.; Dong, Z.; Kang, D.; Zhang, S.; Keutzer, K. Q-diffusion: Quantizing diffusion models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
  27. Borji, A. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and dall-e 2. arXiv 2022, arXiv:2210.00586. [Google Scholar]
  28. Eyadah, H.; Tawfiqe, A.; Odaibat, A.A. A Forward-Looking Vision to Employ Artificial Intelligence to Preserve Cultural Heritage. Humanities 2024, 12, 109–114. [Google Scholar] [CrossRef]
  29. Gaber, J.A.; Youssef, S.M.; Fathalla, K.M. The role of artificial intelligence and machine learning in preserving cultural heritage and art works via virtual restoration. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 10, 185–190. [Google Scholar] [CrossRef]
  30. Dandan, Z.; Lin, Z. Research on innovative applications of AI technology in the field of cultural heritage conservation. Acad. J. Humanit. Soc. Sci. 2024, 7, 111–120. [Google Scholar]
  31. Sun, D. Application of traditional culture in intelligent advertising design system in the internet era. Sci. Program. 2022, 2022, 7596991. [Google Scholar] [CrossRef]
  32. Winiarti, S.W.; Sunardi, S.; Ahdiani, U.; Pranolo, A. Tradition Meets Modernity: Learning Traditional Building using Artificial Intelligence. Asian J. Univ. Educ. 2022, 18, 375–385. [Google Scholar]
  33. Wu, H. Innovation of Traditional Culture Development Model in Advertising Design Based on Artificial Intelligence. Available online: https://www.clausiuspress.com/article/3440.html (accessed on 14 May 2025).
  34. Hu, Y. Research on the Design Method of Traditional Decorative Patterns of Ethnic Minorities under the Trend of AIGC. J. Electron. Inf. Sci. 2023, 8, 58–62. [Google Scholar]
  35. Hang, W.; Alli, H.; Hawari, N.; Wang, W. Artificial Intelligence in Packaging Design: Integrating Traditional Chinese Cultural Elements for Cultural Preservation and Innovation. Int. J. Acad. Res. Bus. Soc. Sci. 2024, 1826–1836. [Google Scholar] [CrossRef]
  36. Wu, S. Application of Chinese traditional elements in furniture design based on wireless communication and artificial intelligence decision. Wirel. Commun. Mob. Comput. 2022, 2022, 7113621. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.