StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

Park, Junseo; Ko, Beomseok; Kang, Minji; Jang, Hyeryung

doi:10.3390/app151910623

Open AccessArticle

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

Division of Computer Science and Artificial Intelligence, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(19), 10623; https://doi.org/10.3390/app151910623

Submission received: 8 September 2025 / Revised: 22 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025

(This article belongs to the Special Issue Intelligent Computing for Sustainable Smart Cities)

Download

Browse Figures

Versions Notes

Abstract

Recent advancements in text-to-image models, such as Stable Diffusion, have showcased their ability to create visual images from natural language prompts. However, existing methods like DreamBooth struggle with capturing arbitrary art styles due to the abstract and multifaceted nature of stylistic attributes. We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. Using approximately 15 to 20 images of the target style, Single-StyleForge establishes a foundational binding of a unique token identifier with a broad range of attributes of the target style. Additionally, auxiliary images are incorporated for dual binding that guides the consistent representation of crucial elements such as people within the target style. Furthermore, we present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes. Experimental evaluations across six distinct artistic styles demonstrate significant improvements in image quality and perceptual fidelity, as measured by FID, KID, and CLIP scores.

Keywords:

text-to-image models; diffusion models; personalization; fine-tuning

1. Introduction

Recent breakthroughs in text-to-image diffusion models—including Stable Diffusion [1,2], DALL-E [3,4], and Imagen [5], SEDD [6], CoMat [7]—have enabled users to generate high-quality, diverse images simply by describing them in natural language prompts. However, these models typically generate outputs in a limited set of visual styles learned from large-scale training datasets, and adapting them to produce images in specific or user-provided artistic styles remains a significant challenge.

In creative applications such as digital illustration, personalized avatars, and visual storytelling [8,9], users often wish to generate content in a consistent, stylized aesthetic (e.g., anime, romanticism, or pixel-art), using only a handful of style reference images. However, existing diffusion models do not provide a reliable way to make this adaptation, especially for users without access to large datasets or advanced fine-tuning skills.

Beyond these creative domains, controllable text-to-image generation is increasingly relevant to smart city services, where visual communication and immersive content play a critical role [10]. Applications [11,12] such as urban simulation dashboards, digital signage for public health and safety, cultural-heritage promotion, and citizen engagement platforms require stylized yet coherent imagery tailored to local identities. Manual design of such visuals is costly and inflexible, whereas few-shot style-adaptive generation can provide scalable solutions [13] for producing culturally adaptive and context-aware media in smart city environments. Therefore, a personalization framework that reliably learns a city- or service-specific visual idiom from a small number of examples is of practical importance.

Prior methods such as DreamBooth [14], Textual Inversion [15], SOD-T2I [16], and LoRA [17] enable subject-specific fine-tuning of pre-trained text-to-image models with a small number of reference images. Despite significant progress in synthesizing images that mimic the artistic styles of renowned painters and artistic icons, such as Picasso or Impressionism, these methods are designed for object- or identity-level personalization and often fail to generalize when applied to abstract artistic styles. The concept of “artistic styles” involves complex visual elements across subjects, backgrounds, and textures, e.g., “an Asian girl on a London street in the style of Van Gogh”. Moreover, they are prone to overfitting, language drift [14], and limited compositional control, especially under low-data settings.

To address these limitations, we propose StyleForge, a user-adaptive personalization framework that transforms a general-purpose text-to-image diffusion model into a style-specialized generator using only a few reference images. StyleForge introduces a dual-binding mechanism, in which the model learns to associate the following: (i) a style token (e.g., “[V] style”) with the overall visual characteristics of the specific artistic style, which we call the target style, and (ii) an auxiliary token (e.g., “style”) with curated human-centric images from semantically related styles to guide the rendering of people within the style domain. This auxiliary-guided token binding helps preserve style consistency across both people and backgrounds, while preventing overfitting to the limited reference set. Furthermore, we extend our framework to Multi-StyleForge, which separates different style components (e.g., person and background) into multiple tokens (e.g., “[V] style” and “[W] style”), enabling more fine-grained control and compositional alignment between prompt and image.

StyleForge can be applied to any pre-trained diffusion model (e.g., Stable Diffusion v1.5) without modifying the architecture, making it suitable for real-world scenarios where users want to adapt generation models to their desired style quickly and with minimal effort. We evaluate StyleForge across six domains and show that it significantly outperforms existing personalization baselines in both quantitative metrics (FID, KID, CLIP scores) and qualitative fidelity. To summarize, our contributions are as follows:

Dual-binding personalization framework: we introduce StyleForge, a novel approach that binds a target style to both a unique style token and an auxiliary token, enabling controllable and robust style adaptation from limited data.
Auxiliary-guided human rendering: we leverage carefully curated auxiliary images containing human-centric style elements to guide the rendering of people, enhancing generalization and mitigating language drift.
Multi-token decomposition for fine-grained control: we extend our framework, called Single-StyleForge, to Multi-StyleForge, which disentangles person and background characteristics into separate tokens to improve the compositional alignment between the prompt and image.
Plug-and play adaptation for any artistic style: StyleForge requires only 15–20 reference images, imposes no architectural changes, and can be easily integrated with existing text-to-image diffusion models for user-friendly personalization.
Extensive evaluation: We conduct experiments across six distinct art styles and demonstrate superior style fidelity, text–image alignment, and robustness compared to state-of-the-art methods. To summarize, our main result is illustrated in Figure 1.

2. Related Work

2.1. Text-to-Image Synthesis

Text-to-image generation has rapidly evolved with the development of powerful generative models that translate natural language prompts into high-resolution images by learning joint distributions over text and image modalities. Imagen [5] uses pixel-space diffusion with a pyramid structure, while Stable Diffusion [1] performs diffusion in latent space for improved efficiency. DALL-E [3,4] adopts transformer-based autoregressive and two-stage priors for text-conditioned image generation. Other notable models include Muse [18], which employs a masked generative transformer, and Parti [19], which combines ViT-VQGAN tokenization with autoregressive decoding. ConsiStory [20] enhances consistency across image batches using shared attention and feature correspondence mechanisms.

2.2. Style Transfer

Neural style transfer modifies the visual appearance of a content image by applying the stylistic characteristics of a reference image while preserving the content. Early methods leveraged CNN features—such as VGG-based correlations [21]—or GANs like StyleGAN [22], SDP-GAN [23], and PGGAN [24] to generate high-resolution stylized outputs. Domain-specific variants, such as anime-oriented architectures [25], further tailored these methods. Later approaches enhanced content–style disentanglement and control through mechanisms like AdaIn [26], AdAttN [27], and RAST [28], which introduced attention and multi-loss objectives. StyleCLIP [29] enabled intuitive language-based style control. Recent methods based on diffusion models [30,31,32,33,34] offer improved flexibility and fidelity. StyleDiffusion [35] and PatchMatch [36] apply statistical and patch-based transformations, while inversion-based methods [37] and DreamStyler [38] leverage text prompts and language-image models (e.g., BLIP-2 [39]) for text-to-style binding during generation.

2.3. Personalizing and Controlling Diffusion Models

Recent efforts in text-to-image synthesis have focused on personalizing diffusion models [16,40,41] to reflect user-specific content or style. Textual Inversion [15] and NeTI [42] learn new text token embeddings that capture the semantics of a subject or style from limited examples. DreamBooth [14] fine-tunes pre-trained models on a few subject images using a prior preservation loss to maintain semantic consistency, while StyleBoost [43] reinforces style representation by anchoring to a meta-class image. DreamArtist [44] and SpecialistDiffusion [45] explore embedding learning and data augmentation to enhance stylistic fidelity. Beyond diffusion-based models, StyleDrop [46] adapts Muse [18], a generative vision transformer, to render diverse styles via supervised token tuning.

Encoder-based approaches, such as HyperDreamBooth [47], E4T [48], and domain-agnostic tuning encoders [49], inject external priors into diffusion models from a single reference image for rapid personalization. However, these methods struggle with disentangling style from content, often reproducing subjects too closely tied to the reference image, i.e., overfitting. Several works aim to extent personalization to multiple subjects or compositional layouts. CustomDiffusion [50], SVDiff [51], and HyperDreamBooth [47] offer parameter-efficient multi-subject adaptation. SubjectDiffusion [52] and Perfusion [53] control attention via attention maps and rank-one updates, while Break-A-Scene [54] and ControlNet [55] decompose scenes into multiple concepts and inject layout guidance through segmentation-aware control mechanisms. Finally, DAAM [56] and I2AM [57] contribute to understanding and controlling the operation of diffusion models by interpreting their generation processes, with further comparisons available in the Supplementary Materials.

2.4. Toward Stylized Personalization

A few recent works attempt to bridge the gap between style and subject adaptation. StyleDrop [46] uses vision transformers and learns a style-specific token to transfer appearance, but it is limited to known styles and requires architecture-specific modifications. SpecialistDiffusion [45] and DreamArtist [44] improve prompt tuning and sample efficiency but lack generalization to unseen artistic domains. StyleBoost [43] attempts to personalize DreamBooth to stylized outputs by reinforcing latent similarity, but does not disentangle background/person contributions or support compositional control. In contrast, StyleForge introduces a dual-token binding mechanism that explicitly separates the person-oriented and background-oriented components of a target style.

3. Preliminaries

3.1. Diffusion Models

Diffusion models [58,59,60] are probabilistic generative models that learn data distributions by reversing a noise addition process. Starting from random noise, they progressively denoise to generate data samples by learning a series of denoising autoencoders

{ϵ_{θ} (x_{t}, t)}_{t = 1 \dots T}

that predict cleaner versions of noisy inputs

x_{t}

at each timestep t. The training objective minimizes

\begin{matrix} L_{DM} = E_{x, ϵ, t} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥_{2}^{2}], \end{matrix}

(1)

where

ϵ \sim N (0, 1)

is Gaussian noise added to the data.

Latent Diffusion Models (LDMs) extend this by operating in a compressed latent space. An encoder

E

maps images

x

to latent representations

z = E (x)

, where denoising occurs on

z_{t} = α_{t} z + σ_{t} ϵ

with time-dependent scaling factors

α_{t}, σ_{t}

. For text-conditional generation, LDMs minimize

\begin{matrix} L_{LDM} = E_{z, c, ϵ, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥_{2}^{2}], \end{matrix}

(2)

where

c = Γ_{ϕ} (p)

encodes text prompt

p

via a text encoder

Γ_{ϕ}

.

3.2. DreamBooth

DreamBooth [14] personalizes text-to-image diffusion models using 4–6 images of a specific subject. It binds a unique identifier token “[V]” to the subject (e.g., “a [V] dog” for a specific dog) while preserving the meta-class semantics (“dog”). The key innovation is the prior preservation loss that prevents overfitting. During training, meta-class images

x^{pr}

are generated from the frozen model using the prompt “dog”; then, both the instance and prior images are used to fine-tune the model:

\begin{matrix} L_{DB} = E_{z, c, ϵ, ϵ^{'}, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥_{2}^{2} + λ {∥ ϵ^{'} - ϵ_{θ} (z_{t}^{pr}, t, c^{pr}) ∥}_{2}^{2}], \end{matrix}

(3)

where

λ

balances instance-specific learning with meta-class preservation, and

z_{t}

and

z_{t}^{pr}

are noisy latents of instance and prior images, respectively.

4. Method: StyleForge

We tackle the challenge of generating high-quality images in various artistic styles or simply styles using text prompts as guidance. Unlike DreamBooth [14], which relies on synthetic priors with irrelevant semantics, we use curated auxiliary images. Unlike StyleBoost [43], which reinforces latent similarity without disentangling the style and content, we perform explicit semantic decomposition. Unlike SpecialistDiffusion [45] and DreamArtist [44], which rely on augmentation or prompt-only tuning, we separate style components via multi-token binding. Parameter-efficient methods such as LoRA [17], Textual Inversion [15], and Custom Diffusion [50] optimize limited parameters and often fail on out-of-distribution styles. These limitations are amplified for abstract styles different to renowned painters.

4.1. Single-StyleForge: Overall Architecture

To address the above issues, we first propose Single-StyleForge, which reliably generates style variations guided by prompts through comprehensive fine-tuning and auxiliary binding strategies. The architecture of Single-StyleForge is based on the framework of DreamBooth [14], focusing on synthesizing images of a specific style, which we call a target style, rather than a particular object. As illustrated in Figure 2, Single-StyleForge fine-tunes pre-trained text-to-image diffusion models

{\hat{x}}_{θ}

using a few reference images

x

of the target style, called StyleRef images, and a set of auxiliary images

x^{aux}

, called Aux images. The loss function for Single-StyleForge is defined as follows:

\begin{matrix} L_{SSF} = E_{z, c, ϵ, ϵ^{'}, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥_{2}^{2} + λ {∥ ϵ^{'} - ϵ_{θ} (z_{t}^{aux}, t, c^{aux}) ∥}_{2}^{2}], \end{matrix}

(4)

where

z_{t}^{aux} : = α_{t} E (x^{aux}) + σ_{t} ϵ^{'}

and

c^{aux}

are used instead of meta-class prior components in (3). In particular, the second term in (4) acts as an auxiliary term that guides the model with information similar to the human perception of the target style. Finally,

λ

controls the strength of the second term.

For given target style, we use 15–20 StyleRef images

x

that showcase the key characteristics of the style. These images include a mix of landscapes, objects, and people to provide a comprehensive representation of the target style. The corresponding StyleRef prompts

p

are crafted to encapsulate the essence of the style using a unique token identifier (e.g., “[V] style”). Unlike the meta-class prior images

x^{pr}

in DreamBooth, the auxiliary images

x^{aux}

are collected to supplement the StyleRef images. These images are chosen to enhance the model’s understanding of additional context in the target style, capturing generic features for challenging elements like human faces and poses. The auxiliary prompts

p^{aux}

use a general token (e.g., “style”) to ensure they provide broad guidance without introducing bias. Finally, the dataset

D

for training comprises pairs of StyleRef images with their prompts

(x, p)

and Aux images with their prompts

(x^{aux}, p^{aux})

.

The training process of Single-StyleForge involves the following steps. First, load the pre-trained weights for the encoder

E

, text encoder

Γ_{ϕ}

, and U-Net

ϵ_{θ}

(line 1), and sample a pair of StyleRef and Aux images and prompts from the dataset

D

(line 3); then, encode the StyleRef and Aux prompts

p, p^{aux}

into latent codes

c, c^{aux}

using the text encoder (line 4). Next, the forward diffusion process is performed to generate noisy latent codes

z_{t}, z^{aux}

from sampled noises

ϵ, ϵ^{'}

(lines 5–7); finally, optimize the model by minimizing the loss (4), taking a gradient descent step to update the model parameters

ω

. The detailed training process is outlined in Algorithm 1.

Algorithm 1 Single-StyleForge.

Require: dataset $D = {(x, p), (x^{aux}, p^{aux})}$ , encoder $E$ , text encoder $Γ_{ϕ}$ , U-Net $ϵ_{θ}$ , hyper-parameters ${σ_{t}, α_{t}}_{t = 1, \dots, T}$ and control parameter $λ$
Ensure: trained model $Γ_{ϕ}, ϵ_{θ}$ with learnable weights $ω = {θ, ϕ}$

1:: Initialize: load pre-trained weights for $E, Γ_{ϕ}, ϵ_{θ}$
2:: repeat
3:: sample a data $(x, p), (x^{aux}, p^{aux}) \sim D$
4:: obtain conditioning vectors $c = Γ_{ϕ} (p)$ , $c^{aux} = Γ_{ϕ} (p^{aux})$
5:: sample time $t \sim Uniform (1, \dots, T)$
6:: sample noise vectors $ϵ, ϵ^{'} \sim N (0, I)$
7:: $z_{t} : = α_{t} E (x) + σ_{t} ϵ$ , $z_{t}^{aux} : = α_{t} E (x^{aux}) + σ_{t} ϵ^{'}$ ▹forward diffusion processes
8:: $ω \leftarrow ω - Optimizer (∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥_{2}^{2} + λ {∥ ϵ^{'} - ϵ_{θ} (z_{t}^{aux}, t, c^{aux}) ∥}_{2}^{2})$ ▹gradient descent step
9:: until converged

4.2. Rationale Behind Auxiliary Images

The use of auxiliary image

x^{aux}

is crucial in the training process of Single-StyleForge, and its two main roles are discussed here.

4.2.1. Aiding in the Binding of the Target Style

While binding a unique token to an object is relatively straightforward, as in DreamBooth [14], capturing the diverse features of a target style and combining them with the identifier pose a challenge. Due to significant variations in artistic styles, learning features such as vibrant colors, exaggerated facial expressions, and dynamic movements in styles like `anime’ becomes difficult with only a few StyleRef images

x

. Moreover, we observe that a pre-trained text-to-image model (e.g., Stable Diffusion v1.5) embeds a wide range of characteristics associated with the word “style”, mostly including fashion styles, fabric patterns, and often art styles, as shown in Figure A1. Instead of retaining unnecessary meanings of the word “style”, we propose utilizing Aux images

x^{aux}

to allow the token “style” to encapsulate some concepts essential for expressing artwork features. As a result, while the StyleRef images

x

and prompt

p

(e.g., “[V] style”) capture overall information about the target style, the Aux images

x^{aux}

and prompt (e.g., “style”) provide more detailed information about specific aspects, such as how to represent a person in that style. This adjustment redirects the embedding of the word “style” from unrelated meanings, such as fashion styles, to general artwork styles, alleviating overfitting and thereby enhancing overall learning performance.

4.2.2. Improving Text-to-Image Performance

In [14], it is demonstrated that using a set of meta-class prior images generated by pre-trained diffusion models, a meta-class prompt can enhance personalization capability. However, in our case, we found that using a set of images generated by pre-trained diffusion models using a prompt “style” resulted in unnecessary context such as fashion styles, making it difficult to incorporate essential information during the training process. Additionally, we noticed that providing detailed descriptions of a person (e.g., hands, legs, facial features, and full-body poses) remains crucial in qualitative evaluations, while drawing landscapes or animals has less impact. Therefore, in Single-StyleForge, Aux images

x^{aux}

primarily consist of portraits and/or people collected from high-resolution images on the Internet, with the aim of enhancing the overall text-to-image synthesis performance, particularly for generating high-quality images related to people.

4.2.3. Mitigating Language Drift

As pointed out in [14], personalization of text-to-image models commonly leads to several issues: (i) overfitting to a small set of input images (i.e., StyleRef images), resulting in images that are specific to a particular context and subject appearance, such as images with the same background, as well as a lack of alignment between text and image, and (ii) language drift, which causes the model to associate the prompt with a limited set of input images and lose diverse meanings of the meta-class name. However, when personalizing models to fit a style rather than a specific subject, it is observed that language drift becomes less of a concern, as “style” is an abstract concept that does not require strict adherence to the meaning and diversity of the word. We expect that if the desired style concept is encoded in the “style” token, then language drift is not a significant issue.

4.3. Multi-StyleForge

While Single-StyleForge focuses on learning a comprehensive representation of a target style, Multi-StyleForge enhances this capability by separating the stylistic attributes into multiple specific components. This approach aims to improve the alignment between the text prompts and the generated images, particularly for styles that involve complex compositions of backgrounds and persons.

Multi-StyleForge builds on the foundation of Single-StyleForge by dividing the components of the target style and mapping each to a unique identifier for training by adopting the method in [50]. Since Single-StyleForge maps StyleRef images to a single prompt (e.g., “[V] style”), thus, during the inference, using a prompt without person-related descriptions, it often produces an image that includes a person. To address this issue, Multi-StyleForge uses two StyleRef prompts (e.g., “[V] style, [W] style”), one for persons and another for backgrounds, to train the model more effectively. By separating these elements explicitly, we address the ambiguity that can arise when a single prompt is used to capture both. We also note that it can be extended to separate into more than two components.

4.3.1. Multi-StyleRef Prompts Configuration

The StyleRef images consist of two parts: elements of people and backgrounds in the target style, following the structure from Single-StyleForge. Each component is then associated with its specific prompt (e.g., “[V] style” for persons and “[W] style” for backgrounds). The Aux images and prompt

x^{aux}, p^{aux}

are kept unified as “style” to ensure they provide general guidance applicable to both components. As a result, Multi-StyleForge trains the model to differentiate stylistic features (people and backgrounds) and obtain separate embeddings.

4.3.2. Training of Multi-StyleForge

Our Multi-StyleForge is built on a custom-diffusion [50] framework that personalizes multiple subjects, but the difference is learned by full-tuning. The training process of Multi-StyleForge involves simultaneous or parallel learning of the two StyleRef prompts. When personalizing each StyleRef prompt sequentially, there is a risk of losing information about previously learned StyleRef prompts in the subsequent process. Therefore, Multi-StyleForge takes the simultaneous learning of multiple StyleRef prompts. The Multi-StyleForge operation typically involves two text–image pairs

D_{1}

and

D_{2}

, as illustrated in Algorithm 2. In particular, each component is selected with a probability proportional to the number of data samples (line 3); then, the model is trained using the selected StyleRef data, similar to the Single-StyleForge, to capture the associated stylistic components (lines 4–8).

Algorithm 2 Multi-StyleForge.

Require: Data $D_{1} = {(x, p), (x^{aux}, p^{aux})}$ , $D_{2} = {(x, p), (x^{aux}, p^{aux})}$ , encoder $E$ , text encoder $Γ_{ϕ}$ , U-Net $ϵ_{θ}$ , hyper-parameters ${σ_{t}, α_{t}}_{t = 1, \dots, T}, λ$
Ensure: Trained models $Γ_{ϕ}, ϵ_{θ}$ with learnable weights $ω = {θ, ϕ}$

1:: Initialize: $q = \frac{| D_{1} |}{| D_{1} | + | D_{2} |}$ , load pre-trained weights for $E, Γ_{ϕ}, ϵ_{θ}$
2:: repeat
3:: select a dataset $D = D_{1}$ if $Q \sim Uniform ([0, 1]) < q$ else $D_{2}$
4:: sample a data $(x, p), (x^{aux}, p^{aux}) \sim D$
5:: $c = Γ_{ϕ} (p)$ , $c^{aux} = Γ_{ϕ} (p^{aux})$ , sample time $t \sim Uniform (1, \dots, T)$
6:: sample a noise $ϵ, ϵ^{'} \sim N (0, I)$
7:: $z_{t} : = α_{t} E (x) + σ_{t} ϵ$ , $z_{t}^{aux} : = α_{t} E (x^{aux}) + σ_{t} ϵ^{'}$ ▹forward diffusion processes
8:: $ω \leftarrow ω - Optimizer (∥ ϵ - ϵ_{θ} (z_{t}, t, c) ∥_{2}^{2} + λ {∥ ϵ^{'} - ϵ_{θ} (z_{t}^{aux}, t, c^{aux}) ∥}_{2}^{2})$
9:: until converged

Multi-StyleForge improves the alignment between text and images by reducing ambiguity through the use of multiple specific tokens. During inference, these tokens guide the generation process to ensure that images align more accurately with the corresponding components and prompts. For instance, using tokens “[V]” for persons and “[W]” for backgrounds helps the model generate images with clear distinctions between these two elements. Experiments conducted with Multi-StyleForge show high fidelity performance and better text-image alignment, which will be discussed in Section 5.

5. Experimental Results

In this section, we assess the performance of Single/Multi-StyleForge in personalizing text-to-image generation across different artistic styles. Through experiments, we investigate how well our methods generate high-quality images that faithfully reflect the target styles.

5.1. Experimental Setup

We conducted experiments on six common artistic styles: realism, midjourney, anime, romanticism, cubism, and pixel art. The characteristics of each style are summarized as follows:

Realism focuses on an accurate and detailed representation of subjects.
Midjourney is characterized by detailed rendering and dramatic imaginative expressions, reflecting the distinctive style of the MidJourney model [61].
Anime refers to a Japanese animation style characterized by vibrant colors, exaggerated facial expressions, and dynamic movement.
Romanticism prioritizes emotional expression, imagination, and the sublime, often portraying fantastical and emotional subjects with a focus on rich dark tones and extensive canvases.
Cubism emphasizes representing visual experiences by depicting objects from multiple angles simultaneously, often in polygonal or fragmented forms.
Pixel art involves creating images by breaking them down into small square pixels, adjusting their size and arrangement to form the overall image.

Example images for each style are provided in the top row of Figure 1, demonstrating various visual attributes of the target styles. The StyleRef images for training and evaluation are collected by using pre-trained diffusion models or from the Web. For the target styles of realism, midjourney and anime, we used pre-trained diffusion models from Hugging Face [62] to generate 18,764 images. For the target styles of romanticism and cubism, we collected 3600 images from WikiArt [63], and we obtained 1000 images from Kaggle [64] for the target style of pixel art. In addition, to enhance the ability to generate images of people, we carefully gathered auxiliary images from different auxiliary styles found on the Web. The details of the Aux images are discussed in Section 5.4.

We assessed the quality of the generated images using various metrics, including FID [65], KID [66], and CLIP [67] scores. FID and KID measure the similarity between real and generated images, where lower scores indicate better image quality. In contrast, the CLIP score evaluates the correspondence between images and text, with higher scores reflecting better alignment. To generate diverse images from the trained model, we used 1562 prompts from the Parti Prompts dataset [19], covering 12 categories such as people, animals, and artwork. Examples of these prompts include “the Eiffel Tower”, “a cat drinking a pint of beer”, and “a scientist”. For each prompt, we produced 12 images, yielding a total of 18,744 images for evaluation. Additionally, StyleForge, trained on a target style, generated 6 images per prompt, resulting in a total of 9372 images per style.

5.2. Implementation Details

5.2.1. Ours

As a base diffusion model, we used Stable Diffusion (SD v1.5) that had been pre-trained with realistic images. We employed the Adam optimizer with a learning rate of 1

\times 10^{- 6}

, setting inference steps at 30, with

λ

value of 1 for the experiments. Fine-tuning the pre-trained text-to-image model for six target styles involved minimizing the loss (4) using 20 StyleRef images and 20 Aux images. All subsequent experiments adhered to training iterations that achieved the best FID/KID scores, see Figure A2 in Appendix A. In Single-StyleForge, we set the StyleRef prompt

p

as “a photo of [V] style” and the Aux prompt

p^{aux}

as “a photo of style”. In Multi-StyleForge, StyleRef is divided into two components (i.e., person and background), each paired with “a photo of [V] style” and “a photo of [W] style”, respectively.

5.2.2. Baseline Models

For the baseline models, Textual Inversion [15], LoRA [17], DreamBooth [14], CustomDiffusion [50] were trained to achieve the best FID/KID scores using the same text–image pairs for fair comparison. At this time, all CFG scales were set to

7.5

. Some baseline methods, including Textual Inversion [15] and LoRA [17], do not utilize auxiliary prompts; thus, auxiliary images were not used. In the case of DreamBooth [14], auxiliary images were generated by a pre-trained diffusion model using the auxiliary prompt. CustomDiffusion [50] was used as a baeline of Multi-StyleForge, by using the same text–image pairs, with auxiliary images being generated as in DreamBooth. See Table 1 for the summary.

5.3. Analysis of StyleRef Images

We assessed the impact of the StyleRef images on the personalization performance. While encapsulating the target style with numerous StyleRef images may ease customization, it could reduce accessibility; however, using only 3–5 images, it is challenging to capture the target style. It is important to maintain diversity while personalizing with a limited image set to effectively portray the desired style. Our empirical findings suggest that around 20 StyleRef images are effective for style personalization. To investigate the composition of StyleRef images, we fine-tuned the base diffusion model using 20 StyleRef images with different combinations, without the use of Aux images. In addition, we provide the dataset configurations for both StyleRef and Aux images, which are available online (https://drive.google.com/drive/folders/1MZDv_NyBJm0x6RLWd2ILn0MybiU-D92S, accessed on 28 September 2025).

Only backgrounds: 20 landscape images in the target style.
Only persons: 20 portraits and/or people images in the target style.
Mixed backgrounds and persons: a mix of 10 landscape and 10 people images.

As depicted in Figure 3, using only landscape images for StyleRef made the model misunderstand the target style due to biased information. Similarly, using only people images resulted in overfitting, as the appearance of a person became a crucial element, as shown in Figure 3b. On the other hand, StyleRef images consisting of a mix of landscapes and people effectively captured the general features of the target style, synthesizing well-aligned images with the prompts, see Figure 3c. Finally, Table 2 shows a performance improvement ranging from

50.07 %

to

59.22 %

in quantitative evaluation when using the mixed StyleRef composition.

5.4. Analysis of the Aux Images

5.4.1. Configuration of Aux Images

As discussed in Section 4.2, it is important to properly configure the Aux images properly to improve the performance of Single-StyleForge. When selecting the auxiliary style, we aim to choose a style that is not only similar to the target style but also provides general attributes for creating better images of people. With this purpose in mind, we have observed that styles such as realism, midjourney, anime, and romanticism mainly depict human attributes realistically. In contrast, cubism represents people using polygons and impressionistic shapes, and pixel art portrays realistic human figures using pixels. We have tailored the auxiliary style and images to boost the personalization for each target style. For realism, midjourney, anime, and romanticism target styles, we collected digital painting images that cover the range from realism to abstraction, while impressionism images were selected for the cubism style, and realism images were chosen for the pixel art style.

5.4.2. Auxiliary Binding

We designed the Aux images to play a supporting role in the personalization of the target style. Using the Aux images, we transferred the “style” token from the original fashion style to the artistic style area, facilitating the training process. Our main objective is for the “[V]” token to thoroughly learn the target style and for the “style” token to represent people in the auxiliary style as a booster. In Figure 4, we visualize the attention maps of the tokens “[V]” and “style” to evaluate the auxiliary binding. The attention of “[V]” is evenly distributed across the images, as the StyleRef images contain both the person and the background, whereas the “style” token focuses specifically on the person, as the Aux images only contain the person. The images generated by each token are displayed in the top and bottom rows of Figure 5. It is observed that “style” token effectively captures the attributes of the person itself, while “[V] style” comprehensively produces images of the target style. Finally, the ablation results of using the Aux images are provided in Figure 6, showing that adding the Aux images enhances the FID/KID scores across all target styles, indicating a reduction in overfitting, and there is a slight increase in the CLIP score.

5.4.3. Comparison with DreamBooth

Table 3 presents the numerical results based on the composition of the Aux images, with examples provided in Appendix A (Figure A1). Encoding useful information into Aux images enhances performance compared to generating unrefined Aux images through a pre-trained diffusion model proposed by DreamBooth [14]. In particular, the performance increases over the Style token were

9.02 %

,

14.49 %

,

34.27 %

,

19.86 %

,

63.00 %

, and

16.49 %

. However, composing Aux images with the same style as the target or including dissimilar information (e.g., human-drawn art) can hinder the model’s generalization abilities and lead to overfitting. In summary, while Aux images are not directly linked to the target style, they should complement the style learning process by providing a more comprehensive understanding of visual features and serving as auxiliary bindings for the target style.

5.5. Multi-StyleForge: Improved Text–Image Alignment Method

Multi-StyleForge separates styles using specific multiple prompts to better distinguish between images generated from different text conditions. This approach clarifies the distinctions between people and backgrounds, thereby improving the text alignment. During inference, the prompt “[V]” is used if the text involves a person, “[W]” for the background, and “[V], [W]” if both are relevant. Figure 5 evaluates the effectiveness of each prompt component in separating people and backgrounds. Specifically, the prompt “[V(person) style]” is observed to generate images of a person in the target style, while the prompt “[W(background) style]” creates images of various backgrounds in the target style. Figure 1, Figure 7, Figure A4 and Figure A5 compare the image outputs using Multi-StyleForge and Single-StyleForge with various prompts, highlighting their effects on text alignment. By examining the texture of six target styles, detailed background representation, and the shapes of people and details of their faces in the generated images in these figures, it becomes evident that Multi-StyleForge excels at rendering these elements based on text descriptions. For instance, when evaluating the presence of “sunglasses” in target style romanticism in the top row of Figure 1, such details are observed only in the images generated by Multi-StyleForge. As shown in the Table 4, Multi-StyleForge significantly improves the text alignment compared to other baselines, as evidenced by higher CLIP scores for all target styles except anime, which shows the second-highest score.

5.6. Comparison

Finally, we compared our Single and Multi-StyleForge methods with existing baseline methods. First, quantitative comparisons in terms of FID/KID and CLIP scores are provided in Table 4. We found that the DreamBooth [14] generally performs well, but its effectiveness decreases for styles such as anime, cubism, and pixel art. For Textual Inversion [15], as the target style becomes less realistic, both FID/KID and CLIP scores decrease significantly. Across all target styles, Single-StyleForge demonstrated superior performance in terms of the FID/KID scores, followed by Multi-StyleForge. Regarding the CLIP scores, Multi-StyleForge achieved the best performance by improving the text–image alignment. Compared to Custom Diffusion [50] and LoRA [17], which use the concept of personalizing multi-subjects and/or parameter-efficient fine-tuning, it is evident that our Single/Multi-StyleForge methods using full-tuning perform better in learning the ambiguous subject “style”.

A qualitative comparison with the baselines is illustrated in Figure 7. Some methods often involve a trade-off between reflecting the target artistic style and aligning images with the text. In the case of pixel art, Textual Inversion and LoRA struggle to capture the target style fully. Furthermore, Textual Inversion (in cubism) and Custom Diffusion fail to depict details like “smiling”, with Custom Diffusion often showing a rear view instead. In contrast, our methods faithfully capture both the target style and the text in the generated images. More images generated using other text prompts are provided in Appendix A (see Figure A4 and Figure A5).

5.7. User Study

A user study with 20 participants evaluated our methods against baselines across four metrics—detail attribute reflection, background–person separation, human figure quality, and style consistency—using a 5-point Likert scale (Figure 8). Both Single-StyleForge and Multi-StyleForge consistently outperformed all baselines. Single-StyleForge achieved

35.8

–

46.5 %

improvements, with the largest gain in style consistency (

46.5 %

). Multi-StyleForge further improved the results, reaching

39.8

–

52.6 %

, validating the effectiveness of dual-binding with multi-token decomposition. Improvements across all metrics (>35%) demonstrate that our approach enhances both the style fidelity and text alignment without trade-offs. The questionnaire is provided in Appendix A.

6. Conclusions

We presented StyleForge, a personalization framework that extends text-to-image synthesis to abstract artistic styles using only a small set of references. Single-StyleForge introduces a dual-binding mechanism that pairs a style token with target-style characteristics and an auxiliary token with curated human-centric images, stabilizing the rendering of people while mitigating overfitting. Multi-StyleForge further disentangles person and background attributes via multi-token decomposition, enabling fine-grained compositional control and improved text–image alignment. Across six styles, our experiments show that StyleForge achieves strong style fidelity and alignment with 15–20 references and without architectural changes to the base diffusion model.

Beyond creative media, these properties make StyleForge practical for smart-city content pipelines in which visual communication is central. Examples include rapidly producing style-consistent assets for urban dashboards and digital signage or adapting imagery to local cultural identities for tourism and citizen engagement, while preserving prompt controllability and reducing design turnaround.

The model’s robustness may degrade under domain shifts, such as night scenes, extreme weather, or densely crowded environments. Additionally, the use of human-centric auxiliary images necessitates careful curation to mitigate potential biases. Future work will focus on incorporating layout and geometric controls for complex urban scenarios, developing lightweight provenance mechanisms to support governance, and conducting human-in-the-loop evaluations to assess clarity and inclusivity in civic applications. In summary, StyleForge advances few-shot controllable style personalization for diffusion models and provides a practical pathway for integrating adaptive visual generation into AI-driven smart city services.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app151910623/s1.

Author Contributions

Conceptualization, J.P. and B.K.; methodology, J.P. and B.K.; software, J.P.; validation, J.P., B.K. and M.K.; formal analysis, B.K.; investigation, J.P., B.K., and M.K.; resources, H.J.; data curation, J.P.; writing—original draft preparation, J.P. and B.K.; writing—review and editing, H.J.; visualization, B.K.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2025-RS-2023-00254592), supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2025-24803248).

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://drive.google.com/drive/folders/1MZDv_NyBJm0x6RLWd2ILn0MybiU-D92S.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Training Step

Figure A2 highlights the importance of adjusting the training steps for each target style to optimize the text-to-image synthesis. Personalizing the base model (SD v1.5), which predominantly generates realistic-like art images, to target styles such as realism, romanticism, pixel art, and cubism is relatively straightforward and achieves optimal FID/KID scores with fewer training steps. In contrast, adapting it to target styles less aligned with the base model, such as midjourney and anime, requires extended iterations of 750 and 1000 training steps, respectively. Fewer training steps help mitigate overfitting, maintaining CLIP scores while preserving the text–image alignment. Our approach, which focuses on personalizing styles, generally requires more training steps than object-based personalization methods, e.g., DreamBooth [14]. However, it is observed that increasing the number of training steps does not universally enhance the style representation, as StyleRef images

x

encompass a limited style range and may neglect many visual attributes.

Figure A1. Results of different choices of Aux images

x^{aux}

: (top) images crated from the frozen model using the “style” token; (middle) images created using the “illustration style” token; and (bottom) images drawn by a person.

Figure A1. Results of different choices of Aux images

x^{aux}

: (top) images crated from the frozen model using the “style” token; (middle) images created using the “illustration style” token; and (bottom) images drawn by a person.

The training step influences both the denoising U-Net and the text encoder, which manages the text conditions in the diffusion process. Since multiple StyleRef prompts are used in Multi-StyleForge, more training steps are needed for the text encoder to effectively construct the latent space. This is illustrated in the right panel of Figure A2. In the experiment, FID/KID scores and CLIP scores were compared across different training steps, starting with the same number of steps as Single-StyleForge and increasing by

2.5

times. Using the same training steps as Single-StyleForge results in suboptimal FID/KID scores due to insufficient learning of multiple StyleRef prompts, leading to ineffective style personalization. Generally, doubling the training steps shows optimal performance, but further increases lead to overfitting, resulting in a decrease in overall evaluation metrics.

Figure A2. (left: Single-StyleForge) FID, KID (

\times 10^{3})

, and CLIP scores of generated images as a function of processed fine-tuning steps for different target styles using only StyleRef images. The best FID scores are achieved at 500, 750, and 1000 steps for realism, midjourney, and anime styles, respectively. The best KID scores are achieved at 500, 250, and 250 steps for romanticism, cubism, and pixel art styles, respectively. (right: Multi-StyleForge) The best KID/FID scores of Multi-StyleForge are achieved with doubling the training steps of Single-StyleForge.

Figure A2. (left: Single-StyleForge) FID, KID (

\times 10^{3})

, and CLIP scores of generated images as a function of processed fine-tuning steps for different target styles using only StyleRef images. The best FID scores are achieved at 500, 750, and 1000 steps for realism, midjourney, and anime styles, respectively. The best KID scores are achieved at 500, 250, and 250 steps for romanticism, cubism, and pixel art styles, respectively. (right: Multi-StyleForge) The best KID/FID scores of Multi-StyleForge are achieved with doubling the training steps of Single-StyleForge.

Appendix A.2. Training Strategy Comparison

To further analyze the training design in Algorithm 2, we compared the proposed parallel training with a sequential training baseline. In sequential training,

D_{1}

is fully learned first, and then

D_{2}

is trained independently, which often causes catastrophic forgetting and leads to interference between person and background components. In contrast, parallel training selects one dataset at each step, ensuring balanced updates across components. As shown in Table A1, parallel training achieves lower FID and higher CLIP scores in all evaluated styles, whereas sequential training exhibits performance degradation due to the loss of distinctive information. These results provide empirical evidence that the probabilistic separation adopted in parallel training effectively mitigates information conflict and enhances both fidelity and text–image alignment.

Table A1. Comparison of sequential and parallel training strategies for Multi-StyleForge. Lower FID and higher CLIP indicate better performance. Bold indicates the best results.

Strategy	Realism		Midjourney		Anime
Strategy	FID↓	CLIP↑	FID↓	CLIP↑	FID↓	CLIP↑
sequential training	15.24	29.85	14.92	30.10	22.73	27.45
parallel training (ours)	13.48	31.21	12.76	32.08	20.88	28.66

Appendix A.3. Auxiliary Image

Figure A1 shows examples for each Aux image configuration. The first and second rows visualize the inherent information in the “style” and “illustration style” tokens of the pre-trained text-to-image diffusion model. The “style” token primarily contains semantic information related to fashion styles, while the “illustration style” token includes random artworks or textile patterns, showing significant differences compared to the human-drawn art images in the third row. This highlights the challenges of style personalization compared to object personalization. In object personalization, such as DreamBooth [14], regularization images which play a similar role to auxiliary images provide clear and distinct support. For instance, if the meta-class name had been “dog”, the regularization images would directly reinforce the capability of producing the images of dog. However, when uncertain and less-specific auxiliary images (e.g., those associated with the “style” token in Figure A1) are used in training, unrefined information can interfere as negative auxiliary guidance in the learning of StyleRef images.

Data Generation Details

For the midjourney style, we strategically utilized different versions of MidJourney [61] to create a comprehensive dataset. StyleRef images were generated using MidJourney v3, which produces distinctive artistic outputs characterized by dramatic expressions and exaggerated forms. The v3 model’s inherently stylized and artistically enhanced outputs define what we refer to as the “midjourney style” in this work. In contrast, auxiliary images were generated using MidJourney v5.2, which excels at producing refined and anatomically accurate human representations in the form of digital painting images. This deliberate version difference serves our dual-binding mechanism: v3’s characteristic artistic stylization enables the “[V]” token to capture the distinctive midjourney aesthetic, while v5.2’s digital painting images provide reliable auxiliary guidance for maintaining proper human attributes. This approach demonstrates how different model versions can be strategically leveraged to enhance style personalization performance.

Appendix A.4. CLIP-Based Analysis of Auxiliary Image Selection

To quantitatively evaluate the affinity between StyleRef and Auxiliary sets, we adopted a CLIP- [67] based similarity metric. Let

f (\cdot)

denote the CLIP image encoder. Given StyleRef images

{x_{i}}_{i = 1}^{N}

and Auxiliary images

{x_{j}^{a u x}}_{j = 1}^{M}

, our score is the mean cosine over all pairs:

S i m i l a r i t y (StyleRef, Aux) = E_{i = 1 \dots N, j = 1 \dots M} [cos (f (x_{i}), f (x_{j}^{a u x}))],

(A1)

where

cos (\cdot, \cdot)

denotes the cosine similarity between normalized CLIP embeddings. We evaluate this score for three Auxiliary configurations: Ours, “style” token (Figure A1), and human-drawn art (Figure A1). As shown in Table A2, the curated Auxiliary consistently achieves the highest similarity across all styles, while the human-drawn art set shows the weakest affinity. These results demonstrate that our Auxiliary selection is systematic rather than ad hoc, and they align with the improvements in text–image alignment (CLIP) observed when training with the curated Auxiliary. In particular, the curated Auxiliary surpasses the style token by

+ 0.029

(roman),

+ 0.027

(cubism), and

+ 0.029

(pixel art) and outperforms the human-drawn art set by

+ 0.063

,

+ 0.062

, and

+ 0.063

, respectively, confirming that the curated distribution is more closely aligned with the target styles.

Table A2. Mean CLIP (ViT-L/14) cosine similarity between StyleRef images and candidate Auxiliary sets for each target style (higher is better). Bold indicates the best results.

Auxiliary Set	Roman	Realism	Anime	Midjourney	Cubism	Pixel Art
Style token [14]	0.532	0.520	0.527	0.535	0.517	0.523
Human-drawn art	0.498	0.486	0.490	0.493	0.482	0.489
Ours	0.561	0.546	0.555	0.568	0.544	0.552

Appendix A.5. Qualitative Comparison with Baseline Methods

Figure A4 and Figure A5, along with Figure 7, provide additional qualitative comparison results. Figure 7 illustrates images generated with prompts involving both people and backgrounds, Figure A4 focuses on prompts solely related to people, and Figure A5 focuses on background-related prompts. These comparisons allow us to assess the performance of our method in various scenarios, specifically verifying how effectively our model reflects both style and text in complex descriptions involving people and backgrounds.

In Figure A5, we found that all methods including ours qualitatively reflect the art style and text well in background-related prompts. However, as specifically shown in Figure A4, the baseline methods show a trade-off between capturing the style and aligning with the text in content involving cognitive elements, such as people, where unnaturalness can cause significant errors. For example, in the case of Textual Inversion, it fails to capture the typical visual features of the anime style, e.g., vibrant colors and exaggerated facial expressions. Moreover, it is observed that most models including Single-StyleForge struggle to reflect some detailed text descriptions like “tan skin”. Conversely, Multi-StyleForge, which enhances text–image alignment, achieves superior visual reflection, effectively balancing the style representation and detailed text descriptions. Furthermore, we also conducted performance evaluation with innovative and future-oriented prompts related to smart city, as presented in Figure A6.

Figure A3. Results of transforming input images (the leftmost) using Single-StyleForge. Output images were created with the prompts “a photo of [V] style, a Santa Claus” (first row), “a photo of [V], a man” (second row). Single-StyleForge synthesizes images that accurately reflect artistic styles, even when various forms of input images, including Santa Claus toys and watercolor brush paintings, are used.

Figure A4. Comparison of our methods to existing personalization techniques. The images are guided by a prompt related to people: “a photo of [V] style, a woman with tan skin in blue jeans and yellow shirt”. Our models perform the desired synthesis by reflecting artistic styles and text, including detailed descriptions like “tan skin”.

Figure A5. Comparison of our methods to existing personalization techniques. The images are guided by prompts related to the backgrounds.

Figure A6. Comparison of our methods to existing personalization techniques. The images are guided by a prompt related to a smart city.

Appendix A.6. Applications

Here, we demonstrate an interesting application of our approach. With the ability to incorporate specific styles into arbitrary images, our method leverages techniques from SDEdit [69] and YOLO-Chicken [70] to transform input images into specific artistic styles easily. The model iteratively denoises noisy input images (the leftmost images of Figure A3) into the style distribution learned during training. Therefore, users only need to provide an image and simple text prompt, without requiring any particular artistic expertise or effort. Figure A3 showcases the results of styling input images with our Single-StyleForge method.

Appendix A.7. User Study Questionnaire

We recruited 20 participants, each of whom evaluated six models across six artistic styles and three questions, resulting in a total of 108 independent evaluations. To ensure unbiased assessment, model labels were randomized for each participant (Models A–F). The evaluated models were DreamBooth, Textual Inversion, LoRA, Custom Diffusion, Single-StyleForge, and Multi-StyleForge. Participants rated each image on a 5-point Likert scale (

1 =

Poor,

5 =

Excellent) across four dimensions: Detail Attribute Reflection, Background–Person Separation, Human Quality (excluded for background-only prompts), and Style Consistency.

Figure A7. Example image used in the user study.

References

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Sehwag, V.; Kong, X.; Li, J.; Spranger, M.; Lyu, L. Stretching each dollar: Diffusion training from scratch on a micro-budget. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville TN, USA, 11–15 June 2025; pp. 28596–28608. [Google Scholar]
Jiang, D.; Song, G.; Wu, X.; Zhang, R.; Shen, D.; Zong, Z.; Liu, Y.; Li, H. Comat: Aligning text-to-image diffusion model with image-to-text concept matching. Adv. Neural Inf. Process. Syst. 2024, 37, 76177–76209. [Google Scholar]
Liu, J.; Li, C.; Sun, Q.; Ming, J.; Fang, C.; Wang, J.; Zeng, B.; Liu, S. Ada-adapter: Fast few-shot style personlization of diffusion model with pre-trained image encoder. arXiv 2024, arXiv:2407.05552. [Google Scholar]
Song, N.; Yang, X.; Yang, Z.; Lin, G. Towards lifelong few-shot customization of text-to-image diffusion. arXiv 2024, arXiv:2411.05544. [Google Scholar]
Alshahrani, A. Bridging Cities and Citizens with Generative AI: Public Readiness and Trust in Urban Planning. Buildings 2025, 15, 2494. [Google Scholar] [CrossRef]
Liu, Z.; He, Y.; Demian, P.; Osmani, M. Immersive technology and building information modeling (BIM) for sustainable smart cities. Buildings 2024, 14, 1765. [Google Scholar] [CrossRef]
del Campo, G.; Saavedra, E.; Piovano, L.; Luque, F.; Santamaria, A. Virtual Reality and Internet of Things Based Digital Twin for Smart City Cross-Domain Interoperability. Appl. Sci. 2024, 14, 2747. [Google Scholar] [CrossRef]
De Silva, D.; Mills, N.; Moraliyage, H.; Rathnayaka, P.; Wishart, S.; Jennings, A. Responsible artificial intelligence hyper-automation with generative AI agents for sustainable cities of the future. Smart Cities 2025, 8, 34. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Kim, M.; Yoo, J.; Kwon, S. Personalized text-to-image model enhancement strategies: Sod preprocessing and cnn local feature integration. Electronics 2023, 12, 4707. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.H.; Murphy, K.; Freeman, W.T.; Rubinstein, M.; et al. Muse: Text-To-Image Generation via Masked Generative Transformers. arXiv 2023, arXiv:2301.00704. [Google Scholar]
Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
Tewel, Y.; Kaduri, O.; Gal, R.; Kasten, Y.; Wolf, L.; Chechik, G.; Atzmon, Y. Training-free consistent text-to-image generation. ACM Trans. Graph. 2024, 43, 52. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 4401–4410. [Google Scholar]
Li, R. Image Style Transfer with Generative Adversarial Networks. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, New York, NY, USA, 20–24 October 2021; pp. 2950–2954. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Way, D.L.; Chang, W.C.; Shih, Z.C. Deep Learning for Anime Style Transfer. In Proceedings of the 2019 3rd International Conference on Advances in Image Processing, ICAIP ’19, New York, NY, USA, Chengdu, China, 8–10 November 2019; pp. 139–143. [Google Scholar] [CrossRef]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6649–6658. [Google Scholar]
Ma, Y.; Zhao, C.; Huang, B.; Li, X.; Basu, A. RAST: Restorable Arbitrary Style Transfer. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 143. [Google Scholar] [CrossRef]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
Chen, Y.; Zhou, H.; Chen, J.; Yang, N.; Zhao, J.; Chao, Y. Diffusion Model-Based Cartoon Style Transfer for Real-World 3D Scenes. ISPRS Int. J. Geo-Inf. 2025, 14, 303. [Google Scholar] [CrossRef]
Han, X.; Wu, Y.; Wan, R. A method for style transfer from artistic images based on depth extraction generative adversarial network. Appl. Sci. 2023, 13, 867. [Google Scholar] [CrossRef]
Su, N.; Wang, J.; Pan, Y. Multi-Scale Universal Style-Transfer Network Based on Diffusion Model. Algorithms 2025, 18, 481. [Google Scholar] [CrossRef]
Xiang, Z.; Wan, X.; Xu, L.; Yu, X.; Mao, Y. A Training-Free Latent Diffusion Style Transfer Method. Information 2024, 15, 588. [Google Scholar] [CrossRef]
Yang, H.; Yang, H.; Min, K. Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits. Electronics 2024, 13, 509. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, L.; Xing, W. StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7677–7689. [Google Scholar]
Hamazaspyan, M.; Navasardyan, S. Diffusion-Enhanced PatchMatch: A Framework for Arbitrary Style Transfer With Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 797–805. [Google Scholar]
Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; Xu, C. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10146–10156. [Google Scholar]
Ahn, N.; Lee, J.; Lee, C.; Kim, K.; Kim, D.; Nam, S.H.; Hong, K. DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models. arXiv 2023, arXiv:2309.06933. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Li, H.; Liu, Y.; Liu, C.; Pang, H.; Xu, K. A Few-Shot Steel Surface Defect Generation Method Based on Diffusion Models. Sensors 2025, 25, 3038. [Google Scholar] [CrossRef] [PubMed]
Martini, L.; Iacono, S.; Zolezzi, D.; Vercelli, G.V. Advancing Persistent Character Generation: Comparative Analysis of Fine-Tuning Techniques for Diffusion Models. AI 2024, 5, 1779–1792. [Google Scholar] [CrossRef]
Alaluf, Y.; Richardson, E.; Metzer, G.; Cohen-Or, D. A neural space-time representation for text-to-image personalization. ACM Trans. Graph. 2023, 42, 243. [Google Scholar] [CrossRef]
Park, J.; Ko, B.; Jang, H. StyleBoost: A Study of Personalizing Text-to-Image Generation in Any Style using DreamBooth. In Proceedings of the 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 11–14 October 2023; pp. 93–98. [Google Scholar] [CrossRef]
Dong, Z.; Wei, P.; Lin, L. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning. arXiv 2023, arXiv:2211.11337. [Google Scholar]
Lu, H.; Tunanyan, H.; Wang, K.; Navasardyan, S.; Wang, Z.; Shi, H. Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14267–14276. [Google Scholar]
Sohn, K.; Ruiz, N.; Lee, K.; Chin, D.C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; et al. StyleDrop: Text-to-Image Generation in Any Style. arXiv 2023, arXiv:2306.00983. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Wei, W.; Hou, T.; Pritch, Y.; Wadhwa, N.; Rubinstein, M.; Aberman, K. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv 2023, arXiv:2307.06949. [Google Scholar]
Gal, R.; Arar, M.; Atzmon, Y.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Trans. Graph. (TOG) 2023, 42, 150. [Google Scholar] [CrossRef]
Arar, M.; Gal, R.; Atzmon, Y.; Chechik, G.; Cohen-Or, D.; Shamir, A.; H. Bermano, A. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In Proceedings of the SIGGRAPH Asia 2023 Conference Papers, Sydney, NSW, Australia, 12–15 December 2023; pp. 1–10. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.Y. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1931–1941. [Google Scholar]
Han, L.; Li, Y.; Zhang, H.; Milanfar, P.; Metaxas, D.; Yang, F. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. arXiv 2023, arXiv:2303.11305. [Google Scholar] [CrossRef]
Ma, J.; Liang, J.; Chen, C.; Lu, H. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In Proceedings of the ACM SIGGRAPH 2024 Conference Papers, Tokyo, Japan, 3–6 December 2024; pp. 1–12. [Google Scholar]
Tewel, Y.; Gal, R.; Chechik, G.; Atzmon, Y. Key-locked rank one editing for text-to-image personalization. In Proceedings of the ACM SIGGRAPH 2023 Conference, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–11. [Google Scholar]
Avrahami, O.; Aberman, K.; Fried, O.; Cohen-Or, D.; Lischinski, D. Break-a-scene: Extracting multiple concepts from a single image. In Proceedings of the SIGGRAPH Asia 2023 Conference Papers, Sydney, NSW, Australia, 12–15 December 2023; pp. 1–12. [Google Scholar]
Zhang, L.; Agrawala, M. Adding conditional control to text-to-image diffusion models. arXiv 2023, arXiv:2302.05543. [Google Scholar]
Tang, R.; Liu, L.; Pandey, A.; Jiang, Z.; Yang, G.; Kumar, K.; Stenetorp, P.; Lin, J.; Ture, F. What the daam: Interpreting stable diffusion using cross attention. arXiv 2022, arXiv:2210.04885. [Google Scholar] [CrossRef]
Park, J.; Jang, H. I2AM: Interpreting Image-to-Image Latent Diffusion Models via Attribution Maps. arXiv 2024, arXiv:2407.12331. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv 2015, arXiv:1503.03585. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2022, arXiv:2010.02502. [Google Scholar] [CrossRef]
MidJourney. Available online: https://www.midjourney.com/ (accessed on 20 September 2025).
Hugging Face. Available online: https://huggingface.co (accessed on 20 September 2025).
WikiArt. Available online: https://www.wikiart.org/ (accessed on 20 September 2025).
Pixel-Art. Available online: https://www.kaggle.com/datasets (accessed on 20 September 2025).
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying MMD GANs. arXiv 2021, arXiv:1801.01401. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv 2022, arXiv:2208.01626. [Google Scholar]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv 2021, arXiv:2108.01073. [Google Scholar]
Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]

Figure 1. Text-to-image synthesis via Single/Multi-StyleForge personalized in various art styles, from realism to pixel art. The generated images demonstrate our approach’s ability to create aligned and high-fidelity images in each target style (top row) by using a unique token (“[V] style”) in text prompts.

Figure 2. The architecture of Single-StyleForge. StyleRef images of the target style, paired with text prompt (“a photo of [V] style”), and Aux images, paired with the prompt (“a photo of style”), are provided as input images. After fine-tuning, the text-to-image model can generate various images of the target style with the guidance of text prompts.

Figure 3. Comparison of different compositions of StyleRef images. In (a,b), StyleRef images consisting of only background or person, respectively, show that the target style is learned based on biased information, failing to include a girl in (a). The generated images in (c) closely align with the prompts.

Figure 4. Attention maps about “[V]” and “style” token in prompt. As we designed, “[V]” is focusing on a relatively whole area, and “style” is focusing on people. It was made through edited Prompt-to-Prompt [68].

Figure 5. Images generated by each token “[V(person)] style”, “[W(background)] style” and “style” from the model trained by Multi-StyleForge. Images in (top) and (middle) include only the person and background, respectively, while images in (bottom) show the person in an auxiliary style.

Figure 6. Ablation study of Aux images

x^{aux}

for six target styles, displaying FID, KID (

\times 10^{3}

), and CLIP scores.

Figure 6. Ablation study of Aux images

x^{aux}

for six target styles, displaying FID, KID (

\times 10^{3}

), and CLIP scores.

Figure 7. Comparison of our methods to existing personalization techniques. The images are guided by prompts related to humans and backgrounds.

Figure 8. Human evaluation results from 20 participants comparing baseline methods with our proposed StyleForge approaches across four metrics: detail attribute reflection, background–person separation, human figure quality, and style consistency. All metrics use a 5-point Likert scale. Both Single-StyleForge and Multi-StyleForge significantly outperform all baselines.

Table 1. Details of baseline methods in terms of the fine-tuning method and the use of StyleRef and Aux images. Full and partial tuning indicates a tuning of the entire and a subset of the pre-trained model, respectively.

	DreamBooth	Textual Inversion	LoRA	Custom Diffusion	Single-StyleForge	Multi-StyleForge
Tuning method	Full	Partial	Partial	Partial	Full	Full
StyleRef image	✓	✓	✓	✓	✓	✓
Aux image	✓	✗	✗	✓	✓	✓

Table 2. FID and KID (

\times 10^{3}

) scores with different compositions of StyleRef images

x

. Bold indicates the best results.

Table 2. FID and KID (

\times 10^{3}

) scores with different compositions of StyleRef images

x

. Bold indicates the best results.

StyleRef Images	FID Score (↓)			KID Score (↓)
StyleRef Images	Realism	Midjourney	Anime	Romanticism	Cubism	Pixel Art
only backgrounds	$22.804$	$24.598$	$34.629$	–	–	–
only persons	$21.708$	$18.812$	$47.588$	–	–	–
mix of backgrounds + persons	$15.196$	$15.449$	$22.227$	$2.022$	$2.257$	$0.714$

Table 3. Comparison of FID and KID (

\times 10^{3}

) with different compositions of Aux images

x^{aux}

, where we chose the StyleRef composition of backgrounds + persons. Bold indicates the best results.

Table 3. Comparison of FID and KID (

\times 10^{3}

) with different compositions of Aux images

x^{aux}

, where we chose the StyleRef composition of backgrounds + persons. Bold indicates the best results.

Method		FID Score (↓)			KID Score (↓)
Method		Realism	Midjourney	Anime	Roman	Cubism	Pixel-Art
Aux images	Style token [14]	$14.297$	$14.293$	$31.518$	$1.999$	$3.646$	$0.843$
	Illustration style token [14]	$14.093$	$14.466$	$28.570$	–	–	–
	Human-drawn art	$14.263$	$16.366$	$22.836$	–	–	–
	Target style	$15.855$	$13.990$	$29.450$	–	–	–
	Single-StyleForge (ours)	$13.008$	$12.222$	$20.718$	$1.602$	$1.349$	$0.704$

Table 4. Quantitative comparisons with FID, KID (

\times 10^{3}

), and CLIP scores. The table presents FID scores for realism, midjourney, and anime styles, along with KID scores for romanticism, cubism, and pixel art styles, and CLIP scores for all styles. The best and second-best results are indicated in bold and underline, respectively.

Table 4. Quantitative comparisons with FID, KID (

\times 10^{3}

), and CLIP scores. The table presents FID scores for realism, midjourney, and anime styles, along with KID scores for romanticism, cubism, and pixel art styles, and CLIP scores for all styles. The best and second-best results are indicated in bold and underline, respectively.

Method	FID Score (↓)			KID Score (↓)			CLIP Score (↑)
Method	Realism	Midjourney	Anime	Roman	Cubism	Pixel-Art	Realism	Midjourney	Anime	Roman	Cubism	Pixel-Art
DreamBooth [14]	$14.093$	$14.293$	$28.570$	$1.999$	$3.646$	$\underset{̲}{0.843}$	$28.226$	$29.020$	$28.551$	$27.420$	$27.818$	$26.175$
Textual Inversion [15]	$17.048$	$22.797$	$41.654$	$6.113$	$4.783$	$2.330$	$28.227$	$27.063$	$26.284$	$25.482$	$22.984$	$26.497$
LoRA [17]	$\underset{̲}{13.218}$	$16.247$	$24.560$	$8.664$	$13.183$	$2.641$	$\underset{̲}{28.926}$	$\underset{̲}{29.406}$	$29.015$	$\underset{̲}{29.074}$	$\underset{̲}{28.188}$	$\underset{̲}{29.534}$
Custom Diffusion [50]	$21.906$	$20.227$	$35.948$	$7.544$	$6.680$	$2.481$	$28.253$	$29.012$	$28.246$	$26.452$	$27.395$	$25.424$
Single-StyleForge (ours)	$13.008$	$12.222$	$20.718$	$1.602$	$1.349$	$0.704$	$28.761$	$28.616$	$27.551$	$27.488$	$27.304$	$26.719$
Multi-StyleForge (ours)	$13.480$	$\underset{̲}{12.764}$	$\underset{̲}{20.880}$	$\underset{̲}{1.912}$	$\underset{̲}{1.820}$	$1.216$	$31.215$	$32.082$	$\underset{̲}{28.662}$	$31.243$	$30.891$	$29.852$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, J.; Ko, B.; Kang, M.; Jang, H. StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding. Appl. Sci. 2025, 15, 10623. https://doi.org/10.3390/app151910623

AMA Style

Park J, Ko B, Kang M, Jang H. StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding. Applied Sciences. 2025; 15(19):10623. https://doi.org/10.3390/app151910623

Chicago/Turabian Style

Park, Junseo, Beomseok Ko, Minji Kang, and Hyeryung Jang. 2025. "StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding" Applied Sciences 15, no. 19: 10623. https://doi.org/10.3390/app151910623

APA Style

Park, J., Ko, B., Kang, M., & Jang, H. (2025). StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding. Applied Sciences, 15(19), 10623. https://doi.org/10.3390/app151910623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

Abstract

1. Introduction

2. Related Work

2.1. Text-to-Image Synthesis

2.2. Style Transfer

2.3. Personalizing and Controlling Diffusion Models

2.4. Toward Stylized Personalization

3. Preliminaries

3.1. Diffusion Models

3.2. DreamBooth

4. Method: StyleForge

4.1. Single-StyleForge: Overall Architecture

4.2. Rationale Behind Auxiliary Images

4.2.1. Aiding in the Binding of the Target Style

4.2.2. Improving Text-to-Image Performance

4.2.3. Mitigating Language Drift

4.3. Multi-StyleForge

4.3.1. Multi-StyleRef Prompts Configuration

4.3.2. Training of Multi-StyleForge

5. Experimental Results

5.1. Experimental Setup

5.2. Implementation Details

5.2.1. Ours

5.2.2. Baseline Models

5.3. Analysis of StyleRef Images

5.4. Analysis of the Aux Images

5.4.1. Configuration of Aux Images

5.4.2. Auxiliary Binding

5.4.3. Comparison with DreamBooth

5.5. Multi-StyleForge: Improved Text–Image Alignment Method

5.6. Comparison

5.7. User Study

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Training Step

Appendix A.2. Training Strategy Comparison

Appendix A.3. Auxiliary Image

Data Generation Details

Appendix A.4. CLIP-Based Analysis of Auxiliary Image Selection

Appendix A.5. Qualitative Comparison with Baseline Methods

Appendix A.6. Applications

Appendix A.7. User Study Questionnaire

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI