1. Introduction
In recent years, with the rapid advances in text-to-image generation and diffusion models, text-driven style transfer has gradually become an important research topic in image generation. These methods aim to preserve the semantic content described by text prompts while incorporating the visual style of reference images into the generated results, allowing generated images to achieve both semantic consistency and stylistic expressiveness. However, existing methods still face several important challenges during the generation process.
First, in text-driven style transfer tasks, achieving a proper balance between text prompts and style features remains challenging. If the model relies excessively on the conditional information provided by the style image, the generated results may exhibit stronger stylistic characteristics but may also suffer from style overfitting. As a result, the generated images may deviate from the semantic content described by the original text prompt, thereby reducing prompt alignment. In contrast, if the model places too much emphasis on the semantic constraints of the text prompt, the generated results may better preserve the objects, scenes, or structures specified in the prompt. However, the influence of style features during generation may be weakened, leading to weaker style representation and reducing the stylized appearance of the generated images.
In addition, content leakage is another important challenge in text-driven style transfer. Ideally, the style image should only provide style-related information, such as colors, brushstrokes, and textures, rather than directly introducing semantic content into the generated results. However, during the generation process, semantic content from the style image, such as object shapes, scene elements, or other semantic structures, is often injected into the model together with style features. Consequently, the generated images may unintentionally contain content from the reference image itself. This phenomenon not only reduces prompt alignment but may also negatively affect the overall generation quality.
To tackle these issues, existing methods predominantly rely on traditional style-conditioning techniques, such as AdaIN-based feature modulation or static cross-attention mechanisms. However, these conventional approaches typically execute a rigid, one-way feature distribution transformation—such as transforming content features toward the style distribution in a unidirectional manner—and pass the results directly into downstream layers. Because they lack a bidirectional interaction mechanism, these static approaches cannot establish mutual interaction between text prompts and style features, making it highly difficult to adaptively balance style expression and semantic constraints, or to effectively suppress content leakage.
Based on the above challenges and the limitations of existing conditioning methods, this study proposes more effective feature fusion and modulation mechanisms, namely Entropy-Aware Adaptive Fusion (EAAF) and Progressive Feature Reweighting (PFR). Unlike traditional one-way methods, our EAAF introduces a bidirectional feature transformation process that enables mutual interaction between text prompts and style features. It further evaluates their interaction via entropy to determine adaptive fusion weights. Meanwhile, PFR is deployed as a progressive strategy to further eliminate artifacts and refine visual details. As a result, the proposed method improves prompt alignment, style fidelity, and overall image quality. The main objectives of this study are as follows:
To introduce a bidirectional adaptive mechanism (EAAF) that mitigates the imbalance between text prompts and style features in text-driven style transfer.
To reduce content leakage caused by unnecessary semantic content from style images.
To provide a progressive optimization strategy (PFR) that enhances prompt alignment, style fidelity, and the overall visual quality of generated results.
To verify the effectiveness, modularity, and generalizability of the proposed plug-and-play framework through experimental analysis under different style and text prompt conditions.
2. Related Work
2.1. Style Transfer
2.1.1. Image-Driven Methods
Image-driven style transfer methods use content and style images as inputs to generate stylized results. Gatys et al. [
1] first introduced neural style transfer using deep convolutional features to represent content and Gram matrices to represent style. Johnson et al. [
2] later proposed a feed-forward framework based on perceptual loss for real-time style transfer, while Ulyanov et al. [
3] showed that Instance Normalization is more suitable than Batch Normalization for style transfer tasks. Huang and Belongie proposed AdaIN [
4], which aligns the statistical distributions of content and style features for arbitrary style transfer. Li et al. [
5] introduced Whitening and Coloring Transform (WCT) to better align feature distributions, and Li et al. [
6] further improved efficiency through learnable linear transformations.
Subsequent studies focused on improving feature representation and style control. StyleBank [
7] used explicit filter banks to independently represent different styles, while Sanakoyeu et al. [
8] proposed style-aware content loss to better balance content preservation and style consistency. More recently, StyTr2 [
9] introduced Transformer architectures into style transfer to improve global feature modeling. Although image-driven methods provide explicit style control, they still heavily depend on reference images and may struggle with abstract or flexible style descriptions.
2.1.2. Text-Driven Methods
Text-driven methods use natural language as the control signal for image generation and stylization. The development of these methods is closely related to CLIP [
10], which maps text and images into a shared feature space. CLIPstyler [
11] demonstrated that text prompts can directly guide style generation through CLIP-based image-text alignment. StyleCLIP [
12] combined CLIP with StyleGAN latent manipulation for semantic image editing, while Name Your Style [
13] further explored text-driven artistic style control.
Recent studies increasingly combine text prompts with diffusion models and reference style images. DiffStyler [
14] introduced a dual diffusion framework for integrating text guidance with style transfer. InstantStyle [
15] focused on preserving reference style features, while CSGO [
16] introduced separate feature injection for disentangled control of content and style conditions. StyleStudio [
17] addressed feature fusion, prompt alignment, and style overfitting issues.
Other studies focused on controllability and personalization. Textual Inversion [
18] learned dedicated text embeddings for new concepts or styles, DreamBooth [
19] fine-tuned diffusion models using a small number of reference images, and StyleCrafter [
20] extended style-driven generation to text-to-video tasks through adaptive style fusion mechanisms.
2.1.3. Evolution of Model Architectures
From the perspective of model architectures, style transfer methods have evolved from CNN-based and GAN-based approaches to Transformer-based and Diffusion-based frameworks. Early CNN-based methods, such as Gatys et al. [
1], AdaIN [
4], and StyleBank [
7], mainly relied on convolutional features for style representation and feature fusion. GAN-based methods later improved realism and diversity through adversarial learning. CycleGAN [
21] achieved image-to-image translation without paired data, while MUNIT [
22] separated content and style representations for multimodal image translation. StyleGAN [
23] further introduced style-based generators for controllable image synthesis.
Transformer-based approaches improved global feature modeling and long-range dependency learning. StyTr2 [
9] applied Transformers to style transfer, while Taming Transformers [
24] combined VQGAN and Transformers for high-resolution image generation. MASTER [
25] further improved zero-shot and few-shot artistic style transfer through controllable Transformer-based stylization.
Recently, diffusion models have become the dominant framework for text-to-image generation and style transfer due to their strong generation quality and controllability. DDPM [
26] established the foundation of diffusion models, while Dhariwal et al. [
27] demonstrated their superior image synthesis quality compared with previous generative approaches. Latent Diffusion Models (LDMs) [
28] further improved efficiency by performing diffusion in latent space, enabling high-resolution image generation with reduced computational cost. In addition, ControlNet [
29] enhanced controllability by introducing additional conditional branches for incorporating external guidance. Building upon these advances, recent studies have increasingly applied diffusion models to style transfer through style conditioning, feature injection, and attention-based fusion mechanisms, forming the foundation of modern diffusion-based style transfer methods.
2.2. Diffusion-Based Style Transfer Methods
Recent diffusion-based style transfer methods mainly focus on three directions: conditional injection and style representation, prompt-style balancing, and content leakage suppression.
Regarding conditional injection and style representation, IP-Adapter [
30] used decoupled cross-attention to integrate image conditions into pretrained diffusion models. StyleAdapter [
31] introduced dual-path cross-attention and semantic suppression mechanisms to improve text controllability and style consistency. StyleShot [
32] emphasized generalized style representations through style-aware encoders and dedicated style datasets. StyleTokenizer [
33] aligned style representations with text embedding space, while ArtAdapter [
34] improved style fidelity through multi-level style encoders and adaptation mechanisms.
For prompt-style balancing, InstantStyle [
15] selectively injected style-related features into diffusion models to reduce semantic interference. CSGO [
16] independently controlled text and style conditions through separate feature injection. DiffStyler [
14] balanced content and style information through dual diffusion mechanisms, while StyleStudio [
17] focused on prompt alignment and feature fusion control. DEADiff [
35] further disentangled style and semantic representations to improve prompt alignment and controllability.
Other studies focused on controllable generation and feature manipulation. Prompt-to-Prompt [
36] utilized cross-attention maps for controllable text editing, while Plug-and-Play Diffusion Features [
37] enabled image editing and feature manipulation without retraining. In addition, StyleCrafter [
20] extended diffusion-based style control from image generation to video generation tasks.
Overall, diffusion-based style transfer has become a major research direction in recent years. The focus of existing studies has gradually evolved from simple style injection toward style representation learning, content-style balancing, content leakage suppression, and prompt alignment. This trend reflects a shift from merely incorporating style information to achieving more precise coordination between textual semantics and style representations. However, effectively balancing these factors remains challenging, particularly under diverse prompt-style combinations, where content leakage and style degradation may still occur during the denoising process.
2.3. Adaptive Instance Normalization (AdaIN)
Adaptive Instance Normalization (AdaIN) [
4] remains one of the most influential methods in style transfer research. AdaIN aligns the mean and variance of content and style features, allowing the output features to preserve content structure while adopting the target style distribution. Beyond efficient feature fusion, AdaIN introduced the concept of feature distribution alignment, which later became the foundation of many style transfer methods.
Recent text-driven style transfer studies further extended AdaIN from image-to-image alignment to cross-modal feature fusion between text and style representations. For example, StyleStudio [
17] incorporated AdaIN-based alignment into multimodal feature fusion for balancing textual semantics and style conditions. AdaIN also inspired later feature transformation methods, including WCT [
5] and learnable feature transformation approaches [
6].
However, because AdaIN mainly relies on first-order and second-order statistics, it may be insufficient for modeling more complex semantic relationships between content and style features. Consequently, recent studies increasingly introduce attention mechanisms, feature disentanglement, and stage-wise feature modulation strategies to address style overfitting, style degradation, and content leakage problems.
3. Methodology
3.1. Overall Architecture
This study is built upon existing diffusion-based text-driven style transfer frameworks and introduces two plug-and-play modules, namely Entropy-Aware Adaptive Fusion (EAAF) and Progressive Feature Reweighting (PFR), to improve prompt alignment and style preservation while reducing content leakage. Rather than modifying the original diffusion architecture, the proposed modules are designed to be integrated into existing stylization frameworks as lightweight enhancements. As illustrated in
Figure 1, the model takes a text prompt and a reference style image as inputs. The text prompt provides semantic guidance for image generation, while the style image supplies visual style information such as colors, textures, and brushstroke patterns.
In the generation pipeline, the reference style image is first processed by an image encoder to extract style features. The extracted style representations are then projected into a feature space compatible with textual features through a style projection layer. Meanwhile, the text prompt is encoded into text embeddings through a text encoder. Both textual and style features are subsequently injected into the diffusion model as conditional information and continuously participate in the denoising process of the U-Net backbone to progressively guide image generation.
To improve the fusion of textual semantics and style representations, the proposed framework introduces the Entropy-Aware Adaptive Fusion (EAAF) module into the conditional feature fusion process of the diffusion model. The EAAF module adaptively adjusts the fusion behavior between content and style features according to their alignment status. Furthermore, a Progressive Feature Reweighting (PFR) strategy is introduced to further regulate the balance between content and style information during the denoising process. While EAAF focuses on adaptive feature fusion, PFR adjusts feature importance across different denoising stages. Through these designs, the proposed framework aims to preserve semantic consistency while enhancing style expressiveness and reducing the negative effects of content leakage on generation quality.
3.2. Entropy-Aware Adaptive Fusion (EAAF)
To address the imbalance between content and style representations in text-driven style transfer, this study proposes the Entropy-Aware Adaptive Fusion (EAAF) module. Unlike conventional one-way feature alignment methods, the proposed module introduces bidirectional feature distribution transformation, cross-attention feature fusion, entropy-aware dynamic weighting, similarity suppression and feature composition to achieve more adaptive content-style balancing and reduce content leakage.
3.2.1. Bidirectional Feature Distribution Transformation
Conventional feature distribution alignment methods are often based on AdaIN [
4], where content features are transformed to match the statistical distribution of style features. Although this operation allows content features to inherit style-related characteristics while preserving structural information, it only considers one-way distribution alignment from content to style features. Consequently, inconsistencies between the two feature domains may still remain during subsequent fusion stages.
To improve feature compatibility, this study introduces a bidirectional feature distribution transformation mechanism. In addition to transforming content features toward the style distribution, the proposed method also transforms style features toward the content distribution to establish a more symmetric fusion foundation. The transformation process is defined as follows:
where
and
denote content and style features, respectively, while
and
represent the channel-wise mean and standard deviation.
denotes the transformed content features aligned to the style distribution, whereas
denotes the transformed style features aligned to the content distribution.
Compared with conventional one-way alignment strategies, the proposed bidirectional transformation improves distribution consistency before feature fusion and enhances fusion stability in subsequent stages. By aligning both content and style representations at the distribution level, the proposed framework establishes a more balanced feature space for subsequent cross-attention fusion.
3.2.2. Cross-Attention Feature Fusion
After bidirectional feature distribution transformation, cross-attention is employed to fuse content and style representations. The core idea of cross-attention is to establish correlations between different features through similarity estimation among Query, Key, and Value representations. Compared with direct concatenation or weighted averaging, cross-attention provides more flexible and adaptive feature integration.
In the proposed framework, the transformed style feature
is used as the Query, while the transformed content feature
is used as both the Key and Value. The cross-attention process is formulated as follows:
where
denotes the feature dimension,
represents the feature sequence length,
represents the similarity matrix,
represents the similarity between the
-th style feature and the
-th content feature,
denotes the attention map after softmax normalization,
represents the attention weight between the corresponding style and content features, and
represents the fused attention feature. Equation (3) computes the similarity between style and content features, while Equation (4) converts the similarity matrix into attention weights through softmax normalization. Equation (5) then aggregates content features according to the attention distribution.
The proposed design uses style representations as queries and content representations as keys and values. Since the primary objective of this task is style transfer, style features serve as the main guidance during feature fusion, while structural and semantic information from content features is preserved. Consequently, the fused features can simultaneously maintain prompt consistency and style expressiveness. In addition, because the content and style features have already been aligned through bidirectional distribution transformation, the subsequent cross-attention process becomes more stable and effective.
3.2.3. Entropy-Aware Dynamic Weighting and Similarity Suppression
After cross-attention fusion, the model obtains the fused attention feature . However, directly using may still limit the model’s ability to dynamically regulate the relative influence of textual semantics and style representations under different prompt-style combinations. Since different prompt-style pairs may exhibit different levels of semantic alignment confidence, fixed feature fusion strategies may still lead to style overfitting, insufficient stylization, or interference with content structures. Therefore, this study introduces entropy as the basis for adaptive weighting in subsequent feature composition.
Since the attention weight matrix
after softmax normalization can be regarded as a probability distribution, this study uses entropy to measure the concentration of the attention distribution, which is defined as follows:
where
denotes the entropy value of the
-th attention position,
represents the feature sequence length, and
is a small constant set to 1 × 10
−5 for numerical stability. Lower entropy values
indicate that the attention distribution is concentrated on more reliable feature correspondences, implying more reliable semantic alignment, whereas higher entropy values
indicate greater uncertainty in feature correspondences, implying weaker confidence in semantic alignment.
Based on the estimated entropy, the dynamic weighting factor is defined as:
where
represents the entropy value calculated from the attention weights, and
denotes the dynamic weighting factor estimated from the entropy distribution. An exponential decay function is adopted to map entropy values into a bounded weighting factor, enabling α to adaptively adjust according to the attention uncertainty. Since the theoretical upper bound of the entropy
depends on the feature dimensions and can be extremely large, its exact numerical range cannot be predetermined analytically. Therefore, we observed the empirical distribution of
during training. The scaling coefficient 0.1 was introduced as a temperature-like factor to prevent exponential saturation. It maps the empirical range of
into a sensitive region of the exponential function, ensuring that
transitions smoothly between 0.3 and 0.8 rather than collapsing prematurely to the boundaries.
In addition to entropy-aware weighting factor
, a similarity suppression term is introduced to prevent the fused attention feature
from becoming overly similar to the transformed style feature
:
where
represents the similarity suppression coefficient computed using cosine similarity. When the fused attention feature
becomes excessively similar to the transformed style feature
, the suppression coefficient decreases accordingly to reduce the dominance of style information. This mechanism helps alleviate style overfitting and content leakage during feature fusion. The role of the dynamic weighting factor
and similarity suppression coefficient
in the final feature composition process is further described in
Section 3.2.4.
3.2.4. Feature Composition
After obtaining the entropy-aware weighting factor
and similarity suppression coefficient
, the proposed framework performs final feature composition to generate the output feature of the EAAF module. The final fusion process is formulated as follows:
where
denotes the content feature,
represents the fused attention feature, α denotes the dynamic weight assigned to
during feature composition,
acts as a suppression coefficient for reducing excessive style dominance, and
represents the final output feature of the EAAF module.
According to Equation (9), the weighting factor adaptively adjusts the balance between the content feature and the fused attention feature . Larger values of indicate that the model has more confidently captured the semantic content of the text prompt, allowing a greater contribution from the fused attention feature and preserving more style-related information. Conversely, smaller values of α increase the contribution of content features to avoid premature interference from style information when semantic uncertainty remains high.
Meanwhile, the suppression coefficient directly regulates the contribution of the fused attention feature according to its similarity to the transformed style feature . When the similarity is high, becomes smaller, reducing the influence of and preventing excessive style dominance, content leakage, and style overfitting. Conversely, when the similarity is lower, becomes larger, allowing more fused style information to be preserved and enhancing stylistic expressiveness.
Through the joint effects of entropy-aware weighting and similarity suppression, the proposed feature composition mechanism dynamically balances content preservation and style expressiveness under different prompt-style combinations. Consequently, the EAAF module improves semantic consistency, stylization quality, and generation stability while mitigating content leakage and style overfitting.
3.3. Progressive Feature Reweighting (PFR)
In addition to adaptive feature fusion within individual denoising steps, this study further observes that the importance of textual semantics and style representations varies across different stages of the diffusion process. In some cases, style representations gradually weaken during later denoising stages, leading to reduced color saturation, weakened stylization, or visually grayish results in the final generated images. To address this issue, this study proposes a Progressive Feature Reweighting (PFR) strategy to dynamically regulate the influence of content and style features throughout the diffusion process.
The proposed method assumes that different denoising stages focus on different generation objectives. During early denoising stages, the diffusion model primarily establishes global structure and semantic layout; therefore, stronger textual guidance is required to maintain semantic consistency. In contrast, during middle and later stages, the overall content structure has already been formed, making these stages more suitable for strengthening style-related information such as colors, textures, and artistic patterns.
Based on this observation, the proposed PFR strategy progressively adjusts the relative contribution of content and style features according to the current denoising stage. During early stages, the framework increases the influence of content features to stabilize semantic structures. During middle stages, the framework gradually enhances style features to compensate for the attenuation of stylization effects. In the final denoising stages, no additional reweighting is applied to avoid over-enhancement and generation instability. The overall procedure of the proposed PFR strategy is summarized in Algorithm 1.
| Algorithm 1: Progressive Feature Reweighting (PFR) |
Input: fused feature , content feature , style feature , reweighting factor , thresholds and , total diffusion steps Output: reweighted feature for each diffusion step to do if then enhance content feature: else if then enhance style feature: else keep original fused feature: end if end for
return |
Unlike EAAF, which focuses on adaptive feature fusion within individual denoising steps, PFR regulates feature importance from a global diffusion process. By progressively adjusting the influence of content and style information across different denoising stages, the proposed strategy improves semantic stability while preserving stylization quality during later generation stages. As a result, the proposed framework achieves better prompt alignment, stronger style representations, and more stable visual quality in the final generated images.
4. Experiment
4.1. Experimental Setup
Experiments in this study were conducted on a single NVIDIA GeForce RTX 4090 GPU. To evaluate the modularity and generalization capability of the proposed framework, the proposed method was integrated into four existing text-driven style transfer methods, including StyleStudio [
17], CSGO [
16], InstantStyle [
15], and StyleCrafter [
20]. During the generation process, the input style images of all compared methods were resized to 512 × 512, while the generated image resolution was set to 1024 × 1024. The number of denoising steps was fixed at 50 for all experiments. Other generation-related parameters associated with style control followed the original settings of each baseline method to ensure fair comparisons under their default generation conditions. For the proposed Progressive Feature Reweighting (PFR) strategy, the final experiments adopted
,
, and
. The influence of different PFR parameter settings on generation performance is further analyzed in
Section 4.7.4.
Since the proposed EAAF module and PFR strategy are specifically designed as training-free, plug-and-play enhancement modules for diffusion-based text-driven style transfer frameworks, the experimental evaluation focuses on diffusion-based methods that can directly incorporate the proposed modules without retraining or fine-tuning. Conventional non-diffusion style transfer methods differ substantially in generation mechanisms and evaluation protocols, making direct quantitative comparisons difficult under a unified experimental setting. The proposed framework is evaluated using multiple criteria, including CLIP Text Similarity, CLIP Style Similarity, inference time, qualitative visual analysis, and user preference studies, to comprehensively assess prompt alignment, style preservation, computational efficiency, and perceptual quality.
4.2. Dataset
The dataset settings in this study were mainly based on StyleStudio [
17]. For the style image settings, this study adopted StyleBench, which was proposed in StyleShot [
32]. StyleBench contains 73 different style categories covering diverse visual styles, including paintings, illustrations, 3D renderings, sculptures, and material textures. Considering the computational cost and experimental time required for image generation, this study randomly selected 20 styles from the 73 style categories, with one representative image chosen from each style category.
Figure 2 shows the style images used in the quantitative experiments.
For the text prompt settings, this study used a total of 82 prompts for image generation, including 52 prompts obtained from StyleAdapter [
31] and 30 prompts generated by ChatGPT-4o using the format “A <color> <object>.” The prompts provided by StyleAdapter [
31] contain general object and scene descriptions, which were used to evaluate the generation capability of the model under different semantic conditions. In contrast, the prompts in the form of “A <color> <object>.” were designed to examine whether the model could correctly generate the specified colors and object content.
Figure 3 presents the complete list of text prompts used in this study.
In this study, the 20 style images were combined with the 82 text prompts, and one image was generated for each style image and text prompt pair. Consequently, each method produced a total of 1640 generated images, which were subsequently used for quantitative evaluations and qualitative visual analyses.
4.3. Evaluation Metrics
To evaluate the effectiveness of the proposed method in text-driven style transfer, this study adopted two commonly used CLIP-based evaluation metrics: CLIP Text Similarity and CLIP Style Similarity. CLIP Text Similarity was used to evaluate prompt alignment between generated images and text prompts, while CLIP Style Similarity was used to evaluate stylistic similarity between generated images and reference style images.
However, recent studies have pointed out that CLIP Style Similarity does not always accurately reflect actual style preservation quality [
17,
32,
35]. Since CLIP primarily focuses on high-level semantic features, the metric may also be influenced by semantic similarity between images rather than purely stylistic characteristics. Consequently, higher CLIP Style Similarity values may sometimes result from semantic content leakage from the style image instead of better style transfer performance. Therefore, a qualitative visual analysis and user study were also conducted to further evaluate the overall generation quality under different text prompts and style conditions.
4.4. Quantitative Results
This section presents quantitative comparisons of different text-driven style transfer methods before and after integrating the proposed EAAF module and PFR strategy. The quantitative results are summarized in
Table 1, where CLIP-Text evaluates semantic alignment between generated images and text prompts, CLIP-Style measures stylistic similarity between generated images and reference style images, and Inference Time represents the average generation time per image. To verify the modularity of the proposed framework, EAAF and PFR were integrated into StyleCrafter [
20], InstantStyle [
15], CSGO [
16], and StyleStudio [
17], and compared with their original implementations.
As shown in
Table 1, integrating EAAF consistently improved CLIP-Text scores across all four methods, indicating enhanced semantic alignment between generated images and text prompts. Among the compared methods, StyleCrafter [
20] achieved the largest improvement, with the CLIP-Text score increasing from 0.1896 to 0.2549. In contrast, CLIP-Style scores decreased after integrating EAAF into all compared methods, indicating a trade-off between prompt alignment and stylistic similarity when stronger emphasis is placed on textual semantics. However, as discussed in
Section 4.3, CLIP-Style may also be influenced by semantic content and image composition. Therefore, lower CLIP-Style scores do not necessarily indicate poorer style preservation from a human perceptual perspective. The results further show that integrating both EAAF and PFR improves CLIP-Style scores compared with using EAAF alone while still maintaining higher CLIP-Text scores than the original methods. In particular, for StyleCrafter [
20], the CLIP-Style score increased from 0.5922 to 0.6656 after integrating PFR, indicating that PFR can effectively reinforce style representations during later denoising stages.
In terms of generation efficiency, integrating EAAF and PFR only resulted in slight increases in inference time across all methods. Since both modules operate during feature fusion and feature reweighting stages without substantially modifying the original architecture, the proposed framework introduces limited computational overhead while improving overall generation performance. Overall, the experimental results demonstrate that EAAF improves semantic alignment between generated images and text prompts, while PFR further compensates for the reduction in stylistic similarity caused by EAAF. Although the resulting CLIP-Style scores remain lower than those of the original methods, further qualitative analysis and user studies are presented in the following sections to provide a more comprehensive evaluation of style preservation, content leakage, and overall generation quality.
4.5. Qualitative Results
In addition to quantitative evaluations, this study further compares different text-driven style transfer methods through qualitative visual analysis.
Figure 4 presents qualitative comparison results under different combinations of style images and text prompts. The proposed EAAF module and PFR strategy were integrated into different baseline methods, including StyleCrafter [
20], InstantStyle [
15], CSGO [
16], and StyleStudio [
17]. Representative examples are presented to illustrate the visual differences before and after integrating the proposed modules into each baseline method.
As shown in
Figure 4, different methods exhibit varying abilities in balancing textual semantics and style representations. StyleCrafter [
20] is more susceptible to the semantic content of the reference style image, resulting in content leakage. This issue can be observed in the gray hat, snowy mountain peak, and orange backpack examples, where semantic structures from the reference style images are partially preserved in the generated results. InstantStyle [
15] generally preserves stylistic characteristics but exhibits weaker prompt alignment in certain cases. This issue can be observed in the gray hat, red dress, orange backpack, and red umbrella examples, where the generated results fail to accurately reflect the object colors specified in the text prompts, and the target objects are sometimes not clearly represented.
CSGO [
16] and StyleStudio [
17] may occasionally suffer from style overfitting or unstable generation quality. In the umbrella example, both methods exhibit style overfitting, where excessive structural patterns from the reference style image are preserved, resulting in reduced semantic consistency with the text prompt. In addition, facial distortions can be observed in some portrait examples.
After integrating the proposed EAAF and PFR modules, the generated results more consistently reflect the content specified by the text prompts while preserving stylistic characteristics from the reference images. The proposed method effectively alleviates content leakage and style overfitting while improving prompt alignment. As a result, the generated images exhibit a better balance between prompt alignment and style preservation.
In addition to the qualitative comparisons in
Figure 4,
Figure 5 further analyzes the limitations of CLIP-Style for evaluating stylistic similarity. StyleStudio [
17] pointed out that existing style similarity metrics may favor generated images with higher semantic similarity to the style image rather than truly better stylization quality.
Figure 5 presents several generated examples together with their corresponding CLIP-Style scores to illustrate that higher CLIP-Style values do not always correspond to better perceptual style similarity. Some methods achieve higher CLIP-Style scores because the generated images preserve semantic content, object structures, or scene compositions similar to those in the style image, making them closer in the CLIP feature space. However, from a human visual perspective, such results do not necessarily represent better style transfer quality and may instead indicate content leakage. In contrast, some generated images with lower CLIP-Style scores can still preserve stylistic characteristics such as colors, brushstrokes, textures, and overall visual atmosphere. Therefore, CLIP-Style should not be used as the sole indicator for evaluating style preservation. Instead, qualitative visual analysis and user studies are necessary to provide a more comprehensive evaluation of text-driven style transfer quality.
4.6. User Study
In addition to the quantitative and qualitative evaluations, a user study was conducted to further assess the performance of different methods in text-driven style transfer. Since stylized image quality involves subjective factors such as semantic consistency, style preservation, and overall visual perception, human evaluation was employed to complement the objective metrics.
A total of 40 prompt–style image pairs were used for evaluation. To ensure equal contribution from different styles, two prompt–style combinations were randomly selected from each of the 20 styles used in the experimental dataset. A total of 137 valid responses were collected from undergraduate and graduate students majoring in computer science. All participants had completed at least one computer vision-related course and possessed fundamental knowledge of image processing and visual quality assessment.
During the evaluation, participants were provided with the input text prompt, style reference image, and five candidate results generated by StyleCrafter [
20], InstantStyle [
15], CSGO [
16], StyleStudio [
17], and Ours. An example of the user study interface is shown in
Figure 6. The result labeled as Ours was generated by applying the proposed modules to one randomly selected baseline method. To ensure a blind comparison and reduce subjective bias, the method names were hidden, and the image order was randomly shuffled. Three evaluation aspects were considered, including text similarity, style similarity, and overall quality. In total, 16,440 individual evaluations were collected. The user study results are summarized in
Table 2.
For text similarity, the proposed method received the highest user preference rate, accounting for more than half of all votes. This result is consistent with the quantitative CLIP-Text analysis, indicating that the proposed EAAF and PFR modules effectively improve prompt alignment.
For style similarity, the proposed method also achieved the highest user preference rate. Notably, although StyleCrafter [
20] obtained the highest CLIP-Style score in the quantitative evaluation, it received a relatively lower preference rate in the user study. This observation is consistent with the qualitative analysis presented in
Section 4.5 and further suggests that CLIP-Style scores do not always align with human perception of style similarity. In contrast, although our proposed method did not achieve the highest CLIP-Style score, it received the highest user preference rate, indicating that our generated results better match human judgments of stylistic characteristics.
For overall quality, the proposed method received nearly half of all user preferences and outperformed the compared methods. Since this evaluation jointly considers content consistency and style preservation, the results suggest that the proposed framework achieves a better balance between prompt alignment and style preservation, leading to improved overall generation quality.
4.7. Ablation Study
In
Section 4.7.1,
Section 4.7.2,
Section 4.7.3 and
Section 4.7.4, CSGO [
16] was used as the representative baseline for parameter analysis to keep the computational cost manageable. After the parameter settings were determined, the same configuration, including the entropy scaling coefficient of 0.1, the dynamic weighting range of
, and the PFR setting of
, was directly applied to StyleCrafter [
20], InstantStyle [
15], and StyleStudio [
17] without additional parameter tuning. The consistent improvements observed across the four baseline methods in
Table 1 and
Figure 4 indicate that the selected settings are not specific to CSGO and support the transferability of the proposed modules.
4.7.1. Analysis of the Exponential Coefficient in Dynamic Weighting
In Equation (7), the proposed entropy-aware dynamic weighting factor is used to regulate the fusion ratio between textual semantics and style representations. The exponential coefficient determines how varies with entropy and thus affects the balance between prompt alignment and style preservation. To investigate its effect, an ablation study was conducted using different exponential coefficient settings.
As shown in
Figure 7, smaller exponential coefficients lead to a slower decay of
, increasing the influence of style representations during feature fusion. Consequently, the generated images exhibit more pronounced stylistic characteristics, characterized by richer textures and higher color saturation. However, excessive style injection may also lead to style overfitting, resulting in repeated textures, structural distortions, and visual artifacts. In contrast, larger exponential coefficients cause
to decrease more rapidly as entropy increases, encouraging the model to preserve textual semantics during feature fusion. This setting effectively alleviates style overfitting and visual artifacts. However, when the coefficient becomes excessively large, the influence of style representations is weakened, leading to reduced stylization quality despite improved prompt alignment.
Based on the qualitative comparisons in
Figure 7, the exponential coefficient in Equation (7) was set to 0.1. Although some examples, such as the apple sample, exhibit stronger stylistic textures when the coefficient is set to 0.08, the coefficient of 0.1 provides more stable generation results while effectively reducing style overfitting and visual artifacts. The resulting reduction in style intensity is further addressed by the PFR strategy described in
Section 3.3.
4.7.2. Analysis of the Dynamic Weighting Range
Although the entropy-aware dynamic weighting factor is adaptively estimated through Equation (7), its value directly controls the relative contribution of content and style features during feature fusion. Different weighting ranges may therefore significantly influence the balance between prompt alignment and style preservation. This section investigates the effect of different ranges through an ablation study.
As shown in
Figure 8, lower values of
encourage the model to focus more on textual semantics, resulting in clearer object structures and stronger prompt alignment. However, the influence of style representations becomes relatively weak, leading to less distinctive colors, textures, and brushstroke patterns. In particular, when
is smaller than 0.3, insufficient style injection causes the generated images to rely excessively on content features, resulting in weaker stylization performance. In contrast, larger values of
increase the influence of style representations, leading to stronger stylization effects. However, excessively large
values may lead to style overfitting. When
exceeds 0.8, the generated images become less stable and may exhibit repeated textures, structural distortions, or visual elements inconsistent with the text prompt.
According to the qualitative comparisons shown in
Figure 8,
values between 0.3 and 0.8 provide a favorable balance between prompt alignment and style preservation. Therefore, the range of 0.3–0.8 is adopted as the operating range of
in Equation (7).
4.7.3. Analysis of the Similarity Suppression Mechanism
To evaluate the effect of the proposed similarity suppression mechanism, an ablation study was conducted by comparing the generated results with and without the similarity suppression coefficient .
As shown in
Figure 9, generated results without the similarity suppression mechanism are more likely to contain visual content that is inconsistent with the text prompt. In the first example, the prompt requires a white plate; however, when
is not applied, the generated image contains leaf-like textures and plant structures, causing the plate surface to exhibit visual content that does not correspond to the prompt. This observation suggests that the model may suffer from style overfitting in certain cases, where style features exert excessive influence on the generated results and consequently reduce the consistency between the generated content and the text prompt. In contrast, after incorporating
, the generated image more accurately reflects the white plate described in the prompt while preserving stylistic characteristics such as colors and brushstroke patterns from the reference style image. These results indicate that the similarity suppression mechanism effectively mitigates style overfitting while improving prompt alignment without substantially sacrificing style preservation.
4.7.4. Analysis of PFR Parameters
To further investigate the influence of different parameter settings in the proposed PFR strategy, an ablation study was conducted using various stage configurations and weighting strengths. The quantitative results are summarized in
Table 3, where
and
denote the stage ranges for content and style feature reweighting, respectively, and
represents the weighting strength.
As shown in
Table 3, configurations with
generally achieve higher CLIP-Style scores but lower CLIP-Text scores than those with
. This observation is consistent with the diffusion denoising process, where the early denoising stages mainly establish the global layout, object structure, and semantic consistency guided by the text prompt. Therefore, emphasizing style features too early may interfere with semantic structure formation and reduce prompt alignment. In contrast, configurations that first strengthen content features and subsequently enhance style features (
) generally achieve better CLIP-Text performance, indicating that establishing a stable semantic structure before style enhancement is beneficial for prompt alignment. Further analysis of different
settings shows that extending the duration of style reweighting generally improves CLIP-Style scores but tends to reduce CLIP-Text performance. Since the later denoising stages mainly focus on local detail refinement, prolonged style enhancement may lead to excessive stylization and interfere with semantic consistency. Therefore, restoring the original weighting after
provides a better balance between prompt alignment and style preservation.
Analysis of different weighting strengths shows that CLIP-Text gradually improves as increases, while CLIP-Style exhibits a decreasing trend. From a mechanism perspective, controls the magnitude of the feature reweighting process. Smaller values provide insufficient feature reweighting, limiting the effectiveness of the PFR strategy, whereas excessively large values may over-amplify the reweighting effect and disrupt the balance between semantic consistency and style preservation. Although larger values achieve slightly higher CLIP-Text scores, they also result in progressively lower CLIP-Style scores. Therefore, the configuration achieves a favorable trade-off between prompt alignment and style preservation and is adopted as the final parameter setting of the proposed PFR strategy.
4.7.5. Analysis of the PFR Strategy
Although EAAF helps alleviate style overfitting, style representations may still gradually weaken during later denoising stages in certain prompt-style combinations, resulting in reduced color saturation and weaker stylistic characteristics. To address this issue, the proposed PFR strategy is further introduced to compensate for the gradual attenuation of style representations during the diffusion process.
Figure 10 presents a qualitative comparison between results generated using EAAF alone and those generated using EAAF together with PFR.
As shown in
Figure 10, images generated using only EAAF generally maintain good prompt alignment but may still suffer from insufficient stylistic expression in certain cases. For example, in the umbrella example generated by CSGO, the image produced with EAAF successfully preserves the umbrella specified in the text prompt; however, the overall appearance is relatively desaturated and fails to fully retain the vivid colors and brushstroke characteristics of the reference style image. After incorporating PFR, the generated image exhibits noticeably improved color saturation, while stylistic attributes such as colors, textures, and brushstroke patterns become more prominent. At the same time, the generated content remains consistent with the text prompt. Similar improvements can be observed in the remaining examples, demonstrating that PFR enhances stylistic characteristics while maintaining prompt alignment.
4.7.6. Analysis of Different QKV Configurations
To validate the selected Query–Key–Value (QKV) assignment, we compared the proposed configuration (style as Query, content as Key/Value) with an alternative arrangement (content as Query, style as Key/Value). The quantitative results are summarized in
Table 4, while qualitative results are shown in
Figure 11.
As shown in
Table 4, the alternative configuration achieves a higher CLIP-Text score but a lower CLIP-Style score than the proposed configuration, indicating a stronger emphasis on prompt semantics at the expense of style preservation. This observation is further supported by the qualitative results in
Figure 11, where the alternative configuration places excessive emphasis on prompt semantics, preventing the model from effectively capturing the style from the reference image. In contrast, the proposed configuration achieves better style preservation while maintaining competitive prompt alignment. Since the primary objective of this work is text-driven style transfer, the Style-as-Query and Content-as-Key/Value configuration was adopted in the proposed framework.
5. Conclusions
This study addressed the challenge of balancing textual semantics and style representations in text-driven style transfer tasks by proposing a framework that combines the Entropy-Aware Adaptive Fusion (EAAF) module and the Progressive Feature Reweighting (PFR) strategy. Experimental results demonstrate that the proposed framework effectively improves prompt alignment while maintaining strong stylistic characteristics. Furthermore, both qualitative analysis and user study results confirm that the proposed method achieves a better balance between content preservation and style expression than existing methods. The main contributions of this study are summarized as follows:
An Entropy-Aware Adaptive Fusion (EAAF) module was proposed to improve feature fusion between textual semantics and style representations. By introducing entropy-aware dynamic weighting and similarity-aware suppression mechanisms, the proposed module adaptively adjusts the influence of content and style features under different prompt-style combinations. Experimental results show that EAAF effectively improves prompt alignment while helping mitigate content leakage and style overfitting, leading to a better balance between content preservation and style representation.
A Progressive Feature Reweighting (PFR) strategy was proposed to alleviate the gradual weakening of style representations during later denoising stages. By progressively adjusting the relative importance of content and style features throughout the denoising process, the proposed strategy enhanced stylistic characteristics while preserving prompt consistency. The ablation study further demonstrated the effectiveness of the proposed strategy in improving style expression under different prompt-style combinations.
The modularity and extensibility of the proposed framework were demonstrated through integration with multiple text-driven style transfer methods. Experimental results show that both EAAF and PFR can be incorporated into different diffusion-based frameworks with only limited computational overhead, demonstrating their potential for integration into a wide range of text-driven image generation frameworks.
While our framework serves as an effective plug-and-play module, existing diffusion-based style transfer methods still face several challenges, such as insufficient expression of extreme artistic styles and weakened semantic alignment for complex long texts. To advance this field, our future work will focus on three key directions. First, we plan to explore higher-order feature distribution alignment to better capture extreme style textures. Second, we aim to develop a dynamic PFR threshold mechanism to replace the current fixed-weight setup for more adaptive temporal control. Lastly, we will investigate multimodal joint optimization strategies to further enhance cross-modal representation and alleviate complex text alignment issues.
Author Contributions
Conceptualization, Y.-F.L., C.-H.C., and C.-L.L.; methodology, Y.-F.L., C.-H.C., and C.-L.L.; software, Y.-F.L.; validation, Y.-F.L. and C.-H.C.; formal analysis, Y.-F.L. and C.-C.L.; investigation, Y.-F.L. and C.-C.L.; resources, C.-H.C.; data curation, Y.-F.L.; writing—original draft preparation, Y.-F.L. and C.-H.C.; writing—review and editing, C.-H.C. and K.-C.F.; visualization, Y.-F.L. and C.-H.C.; supervision, C.-H.C. and K.-C.F.; project administration, C.-C.L., C.-H.C., and K.-C.F. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The datasets used in this study are publicly available, and the relevant sources are cited in the manuscript.
Acknowledgments
The authors would like to thank the reviewers and editors for their valuable comments and suggestions.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar] [CrossRef]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar] [CrossRef]
- Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
- Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar] [CrossRef]
- Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M.-H. Universal style transfer via feature transforms. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 386–396. [Google Scholar]
- Li, X.; Liu, S.; Kautz, J.; Yang, M.-H. Learning linear transformations for fast image and video style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3809–3817. [Google Scholar] [CrossRef]
- Chen, D.; Yuan, L.; Liao, J.; Yu, N.; Hua, G. StyleBank: An explicit representation for neural image style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1897–1906. [Google Scholar] [CrossRef]
- Sanakoyeu, A.; Kotovenko, D.; Lang, S.; Ommer, B. A style-aware content loss for real-time HD style transfer. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 698–714. [Google Scholar] [CrossRef]
- Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. StyTr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Kwon, G.; Ye, J.C. CLIPstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 18062–18071. [Google Scholar] [CrossRef]
- Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2085–2094. [Google Scholar] [CrossRef]
- Liu, Z.-S.; Wang, L.-W.; Siu, W.-C.; Kalogeiton, V. Name your style: Text-guided artistic style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 20–22 June 2023; pp. 3530–3534. [Google Scholar] [CrossRef]
- Huang, N.; Zhang, Y.; Tang, F.; Ma, C.; Huang, H.; Dong, W.; Xu, C. DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 3370–3383. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Spinelli, M.; Wang, Q.; Bai, X.; Qin, Z.; Chen, A. InstantStyle: Free lunch towards style-preserving in text-to-image generation. arXiv 2024, arXiv:2404.02733. [Google Scholar]
- Xing, P.; Wang, H.; Sun, Y.; Wang, Q.; Bai, X.; Ai, H.; Huang, J.-Y.; Li, Z. CSGO: Content-Style Composition in Text-to-Image Generation. Adv. Neural Inf. Process. Syst. (NeurIPS) 2025, 38. [Google Scholar]
- Lei, M.; Song, X.; Zhu, B.; Wang, H.; Zhang, C. StyleStudio: Text-driven style transfer with selective control of style elements. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 11–15 June 2025; pp. 23443–23452. [Google Scholar] [CrossRef]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar] [CrossRef]
- Liu, G.; Xia, M.; Zhang, Y.; Chen, H.; Xing, J.; Wang, Y.; Wang, X.; Yang, Y.; Shan, Y. StyleCrafter: Taming artistic video diffusion with reference-augmented adapter learning. ACM Trans. Graph. 2024, 43, 1–10. [Google Scholar] [CrossRef]
- Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
- Huang, X.; Liu, M.-Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar] [CrossRef] [PubMed]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
- Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar] [CrossRef]
- Tang, H.; Liu, S.; Lin, T.; Huang, S.; Li, F.; He, D.; Wang, X. Master: Meta style transformer for controllable zero-shot and few-shot artistic style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18329–18338. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 6840–6851. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 8780–8794. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
- Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar] [CrossRef]
- Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
- Wang, Z.; Wang, X.; Xie, L.; Qi, Z.; Shan, Y.; Wang, W.; Luo, P. StyleAdapter: A Unified Stylized Image Generation Model. Int. J. Comput. Vis. (IJCV) 2025, 133, 1894–1911. [Google Scholar] [CrossRef]
- Gao, J.; Sun, Y.; Liu, Y.; Tang, Y.; Zeng, Y.; Qi, D.; Chen, K.; Zhao, C. StyleShot: A Snapshot on Any Style. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2025, 48, 1215–1228. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Fang, M.; Zou, C.; Gong, B.; Zheng, R.; Wang, M.; Chen, J.; Yang, M. StyleTokenizer: Defining image style by a single instance for controlling diffusion models. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 110–126. [Google Scholar] [CrossRef]
- Chen, D.-Y.; Tennent, H.; Hsu, C.-W. ArtAdapter: Text-to-Image Style Transfer Using Multi-Level Style Encoder and Explicit Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 8619–8628. [Google Scholar] [CrossRef]
- Qi, T.; Fang, S.; Wu, Y.; Xie, H.; Liu, J.; Chen, L.; He, Q.; Zhang, Y. DEADiff: An efficient stylization diffusion model with disentangled representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 8693–8702. [Google Scholar] [CrossRef]
- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, Y.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt image editing with cross attention control. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-Play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 1921–1930. [Google Scholar] [CrossRef]
Figure 1.
Overall architecture of the proposed framework. The dimension annotations denote [Batch Size × Sequence Length (Token) × Channels dimension]. Specifically, the 1 × 109 × 2048 feature is formed by concatenating 32 style tokens and 77 text tokens (109 = 32 + 77).
Figure 1.
Overall architecture of the proposed framework. The dimension annotations denote [Batch Size × Sequence Length (Token) × Channels dimension]. Specifically, the 1 × 109 × 2048 feature is formed by concatenating 32 style tokens and 77 text tokens (109 = 32 + 77).
Figure 2.
20 style images used in the quantitative experiments.
Figure 2.
20 style images used in the quantitative experiments.
Figure 3.
82 text prompts used in the quantitative experiments.
Figure 3.
82 text prompts used in the quantitative experiments.
Figure 4.
Qualitative comparison of baseline methods and their EAAF/PFR-integrated variants.
Figure 4.
Qualitative comparison of baseline methods and their EAAF/PFR-integrated variants.
Figure 5.
Examples illustrating the limitations of CLIP Style Similarity. (Bold values indicate the highest CLIP-Style score, and underlined values indicate the second-highest score for each example.).
Figure 5.
Examples illustrating the limitations of CLIP Style Similarity. (Bold values indicate the highest CLIP-Style score, and underlined values indicate the second-highest score for each example.).
Figure 6.
Example of the user study interface. A–E denote the five candidate generated images used as voting options.
Figure 6.
Example of the user study interface. A–E denote the five candidate generated images used as voting options.
Figure 7.
Comparison of generated results under different exponential coefficients in the dynamic weighting function.
Figure 7.
Comparison of generated results under different exponential coefficients in the dynamic weighting function.
Figure 8.
Comparison of generated results under different weighting ranges.
Figure 8.
Comparison of generated results under different weighting ranges.
Figure 9.
Qualitative comparison of the similarity suppression mechanism (β).
Figure 9.
Qualitative comparison of the similarity suppression mechanism (β).
Figure 10.
Qualitative comparison of the effectiveness of the PFR strategy in alleviating style weakening.
Figure 10.
Qualitative comparison of the effectiveness of the PFR strategy in alleviating style weakening.
Figure 11.
Qualitative comparison of different QKV configurations.
Figure 11.
Qualitative comparison of different QKV configurations.
Table 1.
Quantitative comparison of different style transfer methods with EAAF and PFR. (↑ indicates that a higher value is better. Bold values indicate the best performance, and underlined values indicate the second-best performance within each baseline method.)
Table 1.
Quantitative comparison of different style transfer methods with EAAF and PFR. (↑ indicates that a higher value is better. Bold values indicate the best performance, and underlined values indicate the second-best performance within each baseline method.)
| Method | CLIP-Text ↑ | CLIP-Style ↑ | Inference Time (s) |
|---|
| StyleCrafter [20] | w/o EAAF & PFR | 0.1896 | 0.7429 | 7.10 |
| w/EAAF | 0.2549 | 0.5922 | 7.91 |
| w/EAAF & PFR | 0.2329 | 0.6656 | 7.92 |
| InstantStyle [15] | w/o EAAF & PFR | 0.2323 | 0.6651 | 6.41 |
| w/EAAF | 0.2557 | 0.5850 | 6.57 |
| w/EAAF & PFR | 0.2479 | 0.6022 | 6.60 |
| CSGO [16] | w/o EAAF & PFR | 0.2137 | 0.6356 | 6.45 |
| w/EAAF | 0.2409 | 0.5929 | 7.35 |
| w/EAAF & PFR | 0.2369 | 0.6066 | 7.36 |
| StyleStudio [17] | w/o EAAF & PFR | 0.2292 | 0.6107 | 15.44 |
| w/EAAF | 0.2465 | 0.5757 | 16.15 |
| w/EAAF & PFR | 0.2422 | 0.5877 | 16.18 |
Table 2.
User preference rates of different methods in the user study. (↑ indicates that a higher value is better. Bold values indicate the best performance in each evaluation criterion.)
Table 2.
User preference rates of different methods in the user study. (↑ indicates that a higher value is better. Bold values indicate the best performance in each evaluation criterion.)
| Method | Text Similarity (%) ↑ | Style Similarity (%) ↑ | Overall Quality (%) ↑ |
|---|
| StyleCrafter [20] | 1.15 | 6.97 | 3.10 |
| InstantStyle [15] | 20.11 | 29.73 | 23.41 |
| CSGO [16] | 9.32 | 13.61 | 10.95 |
| StyleStudio [17] | 19.13 | 15.04 | 16.05 |
| Ours | 50.29 | 34.65 | 46.49 |
Table 3.
Quantitative comparison of different PFR parameter settings. (↑ indicates that a higher value is better. The bold row indicates the final parameter configuration adopted in the proposed method.)
Table 3.
Quantitative comparison of different PFR parameter settings. (↑ indicates that a higher value is better. The bold row indicates the final parameter configuration adopted in the proposed method.)
| Version | (, , ) | CLIP-Text ↑ | CLIP-Style ↑ |
|---|
| v1 | (0, 10, 0.8) | 0.2245 | 0.6151 |
| v2 | (0, 5, 0.8) | 0.2314 | 0.6060 |
| v3 | (0, 20, 0.8) | 0.2110 | 0.6267 |
| v4 | (0, 20, 0.5) | 0.2164 | 0.6250 |
| v5 | (5, 10, 0.8) | 0.2343 | 0.6092 |
| v6 | (5, 15, 0.8) | 0.2265 | 0.6188 |
| v7 | (5, 20, 0.8) | 0.2207 | 0.6241 |
| v8 | (5, 10, 0.5) | 0.2353 | 0.6055 |
| v9 | (5, 15, 0.5) | 0.2291 | 0.6136 |
| v10 | (5, 10, 1.0) | 0.2340 | 0.6088 |
| v11 | (5, 10, 1.1) | 0.2336 | 0.6088 |
| v12 | (5, 10, 1.2) | 0.2345 | 0.6084 |
| v13 | (5, 10, 1.3) | 0.2349 | 0.6083 |
| v14 | (5, 10, 1.4) | 0.2357 | 0.6085 |
| v15 | (5, 10, 1.5) | 0.2363 | 0.6066 |
| v16 | (5, 10, 1.6) | 0.2369 | 0.6066 |
| v17 | (5, 10, 1.7) | 0.2378 | 0.6058 |
| v18 | (5, 10, 1.8) | 0.2388 | 0.6050 |
| v19 | (5, 10, 1.9) | 0.2395 | 0.6043 |
| v20 | (5, 10, 2.0) | 0.2401 | 0.6038 |
Table 4.
Quantitative comparison of different QKV configurations. (↑ indicates that a higher value is better. ✓ and ✗ denote whether EAAF & PFR are applied, respectively.)
Table 4.
Quantitative comparison of different QKV configurations. (↑ indicates that a higher value is better. ✓ and ✗ denote whether EAAF & PFR are applied, respectively.)
| Method | EAAF & PFR | QKV Configurations | CLIP-Text ↑ | CLIP-Style ↑ |
|---|
| CSGO [16] | ✗ | - | 0.2137 | 0.6356 |
| ✓ | Content as Q, Style as K/V | 0.2530 | 0.5417 |
| ✓ | Style as Q, Content as K/V | 0.2369 | 0.6066 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |