A Style-Adapted Virtual Try-On Technique for Story Visualization

Choi, Wooseok; Yang, Heekyung; Min, Kyungha

doi:10.3390/electronics15030514

Open AccessArticle

A Style-Adapted Virtual Try-On Technique for Story Visualization

by

Wooseok Choi

¹,

Heekyung Yang

^2,*,†

and

Kyungha Min

^1,*,†

¹

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Software, Sangmyung University, Cheonan 31066, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2026, 15(3), 514; https://doi.org/10.3390/electronics15030514

Submission received: 14 December 2025 / Revised: 14 January 2026 / Accepted: 23 January 2026 / Published: 25 January 2026

(This article belongs to the Special Issue Application of Machine Learning in Graphics and Images, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

We propose a novel clothing application technique designed for story visualization framework where various characters appear wearing a wide range of outfits. To achieve our goal, we extend a Virtual Try-On framework for synthetic garment fitting. Conventional Virtual Try-On methods are limited to generating images of a single person wearing a restricted set of clothes within a fixed style domain. To overcome these limitations, we apply an improved Virtual Try-On model trained with appropriately processed datasets, enabling the generation of upper and lower garments separately across diverse characters and producing images in four distinct styles: photorealistic, webtoon, animation, and watercolor. Our system collects character images and clothing images and performs accurate masking of garment regions. Our system takes a style-specific text prompt as input. Based on these inputs, garment-specific conditioning is applied to synthesize the clothing, followed by a cross-style diffusion process that generates Virtual Try-On images reflecting multiple visual styles. Our approach significantly enhances the adaptability and stylistic diversity of Virtual Try-On technology for story visualization applications.

Keywords:

character synthesis; face synthesis; diffusion model; blank face

1. Introduction

Recent advances in generative artificial intelligence have transformed character creation and visual representation across a wide range of visual content production domains, including webtoons, animation, games, and video media. In particular, diffusion-based image generation models [1,2,3] have emerged as a powerful tool capable of producing high-quality visual results. In particular, Virtual Try-On (VTON) techniques built upon these models have been actively studied for garment synthesis and human image generation. Despite this technological progress, however, existing VTON frameworks remain limited in their ability to adequately support content production scenarios such as story visualization, in which the same character repeatedly appears wearing diverse outfits and exhibiting multiple visual styles.

Most existing VTON studies [4,5,6,7,8,9,10,11] have been developed primarily within the photorealistic domain, focusing on synthesizing garments for a single person under a fixed visual style. While these approaches have significantly improved garment alignment and visual realism, their extension to non-photorealistic style domains, such as animation, webtoon, and watercolor, has been relatively limited. Moreover, many existing frameworks fail to provide sufficient fine-grained garment controllability, such as independent control of upper and lower garments or partial garment replacement. In addition, style transformation, human structure preservation, and garment alignment are often handled in separate stages or independent modules, leading to fragmented pipelines. These structural limitations reduce visual consistency and practical usability in story-based visualization environments where diverse characters and styles coexist.

Virtual Try-On for story visualization goes beyond a simple image synthesis problem and simultaneously requires preservation of human structural identity, retention of garment-specific texture and shape, and consistent visual transformation across different style domains. However, in prior studies, style information is typically expressed at the text-prompt level or applied as a post-processing step, while garment information is often handled through warping operations or latent concatenation. As a result, methodological approaches that systematically disentangle and integrate person, garment, and style information as independent yet interacting conditioning signals within the attention structure of diffusion models have not been sufficiently explored. This gap highlights a critical technical limitation in supporting stable and controllable combinations of diverse visual styles and garments.

To address these limitations, our work reformulates the Virtual Try-On problem as a multi-domain conditional diffusion generation problem for story visualization. We propose StyleVTON, a unified framework composed of three specialized modules: OmniNet, which captures human structure and pose; GarmentNet, which extracts garment texture and shape; and StyleNet, which controls visual style domains. Our framework separates person, garment, and style information into independent conditioning signals and integrates them into the self-attention and cross-attention layers of a diffusion model. This design enables the stable generation of diverse character–garment–style combinations within a single network. Unlike conventional step-wise pipelines or simple conditional fusion strategies, our architecture is explicitly designed to ensure the visual consistency and controllability required in story visualization scenarios.

Our study aims to answer the following key research question: Can a single diffusion-based Virtual Try-On framework preserve human structural identity and garment-specific visual characteristics while consistently synthesizing garments across both photorealistic and diverse non-photorealistic visual styles (e.g., animation, webtoon, and watercolor)? Furthermore, we experimentally investigate whether modeling person, garment, and style information as functionally disentangled conditioning signals and integrating them within the attention structure of diffusion models can effectively enhance visual consistency and controllability compared to existing VTON methods in story visualization scenarios.

The main contributions of this paper can be summarized as follows. First, we formalize the Virtual Try-On problem for story visualization as a conditional generation task that simultaneously requires style invariance and garment sensitivity. Second, we propose a novel diffusion-based framework that integrates person, garment, and style information as functionally disentangled conditioning signals within the attention mechanism, offering methodological insights beyond simple module composition. Third, by combining style-token-based conditioning with garment-specific attention, we experimentally demonstrate that our approach maintains structural consistency and garment fidelity not only in photorealistic domains but also across diverse artistic styles such as animation, webtoon, and watercolor. Through these contributions, this work presents a scalable and practical diffusion-based Virtual Try-On framework tailored for story-driven visual content production.

This paper is organized as follows: Section 2 reviews the related works on Virtual Try-On and style transfer techniques, highlighting the limitations of current approaches. Section 3 proposes the outline of our method. Section 4 details our style-adapted VTON framework, including the formal definition of the dual-path architecture and the specific training strategy. Section 5 presents the experimental setup and dataset construction, and a comprehensive analysis of the quantitative and qualitative results compared to state-of-the-art methods is presented in Section 6. Finally, Section 7 concludes the paper with a summary of our findings and discussions on future research directions.

2. Related Work

2.1. Early VTON Techniques for Garment Alignment

Early Virtual Try-On (VTON) studies primarily focused on the problem of garment–human alignment as a core challenge, and proposed image warping-based or GAN-based approaches to address it. This line of research played a crucial role in formally defining the VTON problem and laid the foundation for a wide range of subsequent extensions.

VITON [12], one of the representative early works, employed Thin-Plate Spline (TPS) transformations to deform garments according to the target human silhouette, followed by a GAN-based synthesis network to generate the final image. CP-VTON [13] later introduced a GMM-based warping module to better preserve garment patterns, while Distilling Appearance Flows [14] reduced dependency on human parsing by learning appearance flow in a teacher–student manner, aiming for more stable alignment. Although these warping- or flow-based methods improved garment alignment accuracy, they still suffered from misalignment issues under complex pose variations, garment wrinkles, and occlusions.

Ge et al. [15] attempted a successive study to simultaneously achieve human structure preservation and visual realism by leveraging disentangled representations and cycle consistency. Parser-Free VTON [16] alleviated error propagation caused by inaccurate human parsing, while Street TryOn [17] targeted generalization in in-the-wild environments by accommodating diverse backgrounds and poses. Despite extending the applicability of VTON systems, these methods still relied on multi-stage pipelines, which inevitably led to accumulated errors and training instability.

To further improve generalization, GP-VTON [18] proposed a framework that combines global parsing with local garment alignment, enabling relatively stable synthesis under complex pose variations and diverse human conditions. While GP-VTON expanded the applicability of warping-based approaches, it still depended on external alignment modules and multi-stage pipelines, and therefore could not fundamentally resolve error accumulation and identity drift.

High-resolution synthesis-oriented approaches such as VITON-HD [19], as well as High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions [20], extended conventional warping/GAN-based frameworks to high-resolution settings. Although these methods improved visual quality through normalization techniques and conditional processing, their reliance on external alignment modules remained a structural limitation. Similarly, Dress Code [21] expanded fine-grained garment control by supporting multiple garment categories, yet could not fully guarantee alignment stability for complex outfit combinations.

Overall, early VTON techniques handled garment alignment using image warping or GAN-based modules, which introduced inherent limitations such as accumulated alignment errors, identity drift, and increased pipeline complexity. These issues become particularly pronounced in story visualization scenarios, where diverse garments and poses are required and the same character repeatedly appears. Consequently, there is a growing demand for more integrated and stable generation frameworks. The OmniNet and GarmentNet modules used in our framework are designed to address these challenges.

2.2. Diffusion-Based VTON Techniques

2.2.1. Early Diffusion-Based VTON Techniques

Denoising diffusion models [1,2] and Stable Diffusion [3] have served as foundational technologies for Virtual Try-On (VTON) systems capable of synthesizing diverse garment images. Ho et al. [22] introduced Classifier-Free Diffusion Guidance (CFG), which first demonstrated that diffusion models can be controlled solely through conditional signals without relying on external classifiers. By jointly training unconditional and conditional models and adjusting guidance strength based on the divergence between their distributions, CFG significantly improved generation quality and has since become a fundamental component of most Stable Diffusion-based models.

Subsequent studies explored effective methods for injecting structural conditions—such as human pose, edges, and depth—into diffusion models. ControlNet, proposed by Zhang et al. [23], introduced an architecture that integrates external control signals by training them in parallel with the diffusion backbone. In the VTON domain, where human pose preservation is critical, ControlNet substantially improved alignment quality and established a new benchmark for conditional control. T2I-Adapter, proposed by Mou et al. [24], facilitates the injection of diverse conditions by inserting lightweight adapter modules into pre-trained diffusion models. Its flexibility in handling conditional information such as garment masks and body structures makes it particularly effective for enhancing conditional control in VTON pipelines.

Furthermore, Ye et al. [25] proposed an attention-level image conditioning approach through IP-Adapter, which injects CLIP image encoder outputs directly into the attention layers of the diffusion U-Net. This mechanism enables the stable transfer of image-based style, pattern, and texture information, and has been widely adopted as a core component in diffusion-based VTON systems for accurately reflecting garment characteristics.

2.2.2. End-to-End Structure Diffusion-Based VTON Techniques

Diffusion-based VTON research has expanded in multiple directions, centering on the latent space of Stable Diffusion. StableVITON, proposed by Kim et al. [4], directly learns semantic correspondence maps between garments and the human body within the latent space, resolving alignment internally during the diffusion process. This approach enables high-resolution synthesis with strong semantic consistency; however, learning correspondence during diffusion leads to slow inference and unstable alignment under complex pose variations.

TryOnDiffusion by Zhu et al. [8] introduced a dual-UNet architecture that separates garment deformation (warping UNet) from final image synthesis (synthesis UNet). While this decoupling reduces artifacts and improves alignment accuracy, it significantly increases memory and computational costs, and alignment errors from the warping stage may still propagate to the final output.

AF-Diffusion (Taming the Power of Diffusion with Appearance Flow) by Gou et al. [9] proposed a hybrid architecture that combines appearance flow with diffusion models to achieve both warping stability and high-quality generation. Although effective in preserving fine details, flow prediction limitations may still introduce artifacts for complex garments.

More recent diffusion-based VTON studies have become increasingly specialized depending on which structural aspects they aim to preserve. LaDI-VTON by Morelli et al. [7] employs textual inversion to learn garment identity embeddings and performs garment replacement while maintaining photorealistic quality within a latent diffusion framework. Despite offering strong controllability via text-based identity adjustment, its T2I embeddings tend to oversmooth complex garment patterns.

IDM-VTON by Choi et al. [5] focuses on preserving person identity using the latent U-Net of Stable Diffusion. By injecting garment semantic features through IP-Adapter, it achieves high structural alignment without relying on explicit occlusion warping. However, its dependence on IP-Adapter-based features makes it less effective for handling intrinsic garment deformations, often resulting in artifacts for complex clothing.

Recent studies have also explored multi-garment configurations and fine-grained structure control. CatVTON, proposed by Chong et al. [6], adopts a concise architecture that eliminates explicit warping and simply concatenates garment and human latents before feeding them into the diffusion U-Net. Category tokens are used to control garment types such as tops, bottoms, and dresses, enabling stable training and efficient synthesis. Nevertheless, this simple latent concatenation makes the model vulnerable to complex clothing structures such as layered outfits.

OOTDiffusion by Xu et al. [11] introduces an outfitting fusion module to directly control garment features in latent space, achieving superior alignment at human–garment boundaries and robust occlusion handling. It also supports multi-component outfits, allowing simultaneous processing of multiple garments, although interference between combined features may occur.

Beyond photographic inputs, alternative conditioning modalities have also been explored. The Picture model by Ning et al. [10] performs try-on using unconstrained garment designs such as sketches or drafts instead of existing garment photos. While it maintains strong global consistency in photorealistic human synthesis, reliance on sketches can sometimes lead to distortions in garment shape.

In summary, diffusion-based VTON research has evolved from general conditional generation frameworks toward structure-aware conditioning, attention-level image integration, and VTON-specific structural control. However, most existing approaches focus on preserving a single aspect—such as structure, garment identity, or person identity—while jointly modeling and integrating these conditions in a functionally disentangled yet unified manner remains limited. These limitations become especially apparent in story visualization scenarios that require diverse garments, multiple styles, and recurring characters.

3. Outline

In this study, we propose StyleVTON (Style Virtual Try-On), a diffusion-based Virtual Try-On model capable of handling diverse visual styles, subjects, and fine-grained garment control. Our model utilizes four input elements: a person image, a garment image, a mask map, and a style prompt. Based on these inputs, it generates a series of images in which the given garment is naturally applied to the target person with various styles. The architecture of our model is built around a diffusion-based U-Net backbone [26] and consists of the following three core modules:

StyleNet encodes style characteristics from various visual domains—including photorealistic, animation, webtoon, and watercolor—and injects them into the attention layers of the diffusion network.
GarmentNet extracts fine-grained visual features such as garment texture, shape, and color.
OmniNet captures the human structure and pose and restores the masked regions.

These three modules are integrated within the cross-attention and self-attention stages to generate visually consistent synthesized images based on the domain–garment–person combination. Figure 1 illustrates the overall pipeline of our framework.

4. Our Method

4.1. StyleNet

StyleNet is a module designed to encode visual domain–level style characteristics, such as color tone, luminance, contour, and texture. In this study, we consider four distinct style domains: photorealistic, webtoon, animation, and watercolor.

StyleNet employs a CLIP-ViT-H/14-based encoder to generate style embeddings from either textual style prompts or reference style images. The resulting style embedding is defined as

s_{i} \in R^{d},

where

s_{i}

denotes the style token corresponding to the style domain

d_{i}

. The style embedding

s_{i}

produced by StyleNet serves as a global style representation that compactly captures domain-specific attributes, including hue, saturation, luminance, and contour characteristics. This embedding is utilized as an external conditioning signal for OmniNet throughout the entire diffusion process.

4.2. GarmentNet

GarmentNet is a module designed to extract visual appearance information of garments, including texture, color, pattern, and silhouette. In this study, garments are categorized into three types: tops, bottoms, and one-piece garments (e.g., dresses). All categories share an identical encoder architecture, while category-specific normalization parameters are applied to preserve morphological characteristics across different garment types.

Given a garment image

x_{g}

, the garment region is isolated using a garment segmentation mask

m_{g}

. The masked garment input is defined as

x_{g}^{m} = x_{g} ⊙ m_{g},

(1)

where ⊙ denotes element-wise multiplication, and

x_{g}^{m}

represents the masked garment image with the background removed.

The masked garment input

x_{g}^{m}

is then transformed into a latent garment appearance feature through the GarmentNet encoder

E ()

:

f_{g} = E (x_{g}^{m}) .

(2)

The extracted feature

f_{g}

serves as a garment appearance representation that captures the local texture details and structural characteristics of the garment, and is subsequently provided to OmniNet as an external conditioning signal.

4.3. OmniNet

OmniNet is a diffusion-based U-Net backbone operating in the latent space, which reconstructs human body structure and pose while integrating conditional information extracted from GarmentNet and StyleNet to generate the final image.

4.3.1. Garment Semantic Conditioning via IP-Adapter

To effectively inject the semantic identity of garments into the attention space of the diffusion model, we introduce an IP-Adapter-based conditioning interface within OmniNet. The IP-Adapter takes the original garment image

x_{g}

as input and produces key–value representations that can be directly injected into the attention mechanism:

(K_{i}, V_{i}) = IP - Adapter (x_{g}) .

(3)

The projection layers of the IP-Adapter are configured to be learnable, enabling the model to learn semantic alignment between garment semantic features and the human latent space.

Textual, style, and garment semantic conditions are jointly considered in the cross-attention stage of OmniNet according to the following equation:

A t t e n t i o n ((Q, [K_{t}, K_{s}, K_{i}], [V_{t}, V_{s}, V_{i}])) = s o f t m a x (\frac{Q {[K_{t}, K_{s}, K_{i}]}^{T}}{\sqrt{d}}) [V_{t}, V_{s}, V_{i}] .

(4)

Here,

$(K_{t}, V_{t})$ denote the key–value pairs derived from the text prompt embeddings.
$(K_{s}, V_{s})$ denote the key–value pairs generated from the style embedding $s_{i}$ produced by StyleNet.
$(K_{i}, V_{i})$ denote the garment semantic features generated via the IP-Adapter.

4.3.2. Human-Garment Appearance Fusion via Self-Attention

Meanwhile, the garment appearance feature

f_{g}

extracted from GarmentNet is fused with the human feature

f_{p}

within OmniNet through a self-attention mechanism. Specifically, the joint representation of the two features is processed as

[f p^{'}, f g^{'}] = S e l f A t t e n t i o n ([f p, f g]) .

(5)

This self-attention operation selectively transfers garment texture information to spatially compatible human regions, thereby preserving the underlying human structure while stably reflecting fine-grained garment details. The resulting features after self-attention are subsequently forwarded to the decoder for final image synthesis.

4.3.3. Style Conditioning via Adaptive Layer Normalization

In addition, OmniNet is designed such that style embeddings directly influence the feature distributions through Adaptive Layer Normalization (AdaLN). The AdaLN operation is defined as

A d a L N (x) = γ (s_{i}) \cdot \frac{x - μ (x)}{σ (x)} + β (s_{i}),

(6)

where

γ

and

β

denote the scale and shift parameters predicted from the style embedding

s_{i}

. This formulation enables domain-specific color tone and luminance characteristics to be consistently reflected throughout the diffusion process.

Through this attention-based multi-source conditioning architecture, OmniNet simultaneously achieves the preservation of human structure, the maintenance of garment appearance, and controlled visual style transformation.

4.4. Loss Function

Our model is trained using diverse human data across four visual domains—photorealistic, webtoon, animation, and watercolor—covering a wide range of ages and genders. Each training sample consists of a tuple

(x_{p}, m_{p}, x_{g}, m_{g}, s_{i})

and the model is optimized by jointly minimizing a diffusion-based noise prediction loss together with style and garment consistency losses. The overall objective function is defined as

L = E x t,, ϵ [{∥ϵ - ϵ_{θ} (x_{t};, c, s_{i}, t)∥}_{2}^{2}] + λ_{1} L s t y l e + λ 2 L_{c l o t h},

(7)

where

ϵ_{θ} (\cdot)

denotes the noise prediction network parameterized by

θ

.

The first term corresponds to the standard denoising score matching loss, which encourages OmniNet to learn a stable denoising process under the given conditions. During this process, textual prompts and style tokens are injected via cross-attention, while garment semantic information is integrated into the same attention space through the IP-Adapter. In contrast, garment appearance information is incorporated through the self-attention pathway within OmniNet.

The style consistency loss

L_{s t y l e}

is a CLIP-based objective that constrains excessive changes in human structural representation when only the style token is altered for the same human–garment input. The garment fidelity loss

L_{c l o t h}

is designed to ensure that the garment appearance features extracted by GarmentNet are faithfully preserved in the generated results.

By simultaneously providing domain tokens (style tokens) and garment-related conditions, the model is able to learn domain transfer (style transfer) and garment replacement (Virtual Try-On ) within a single unified training framework.

4.5. Training Strategy for Human and Garment Adaptation

Our model is designed to generate consistent results across diverse human types (e.g., gender, age, and body shape) and garment categories (tops, bottoms, and dresses). To this purpose, we adopt a dual-branch representation architecture that conceptually separates human representations and garment representations, which are subsequently fused within OmniNet through attention-based mechanisms.

Here, the dual-branch structure does not rely on an explicit, separate human encoder. Instead, it refers to the disentangled handling of (i) human structural features that naturally emerge within the diffusion U-Net encoder of OmniNet, and (ii) garment appearance features explicitly extracted by GarmentNet. The intermediate latent features inside OmniNet implicitly encode human structure and pose information during the denoising process, forming a pose-aware representation that is relatively robust to spatial deformations caused by variations in body shape or age.

Garment representations are extracted via GarmentNet, which learns appearance attributes such as silhouette, length, and wrinkle patterns through category-specific module designs. During training, the encoder parameters of GarmentNet are kept fixed (frozen) to prevent distortion of the original garment texture and color information.

These two representations are fused inside OmniNet through self-attention-based spatially aware fusion, enabling garment textures to be applied to spatially consistent human regions. This design allows stable garment synthesis without introducing distortions to the underlying human structure. Meanwhile, garment semantic features injected through the IP-Adapter and style tokens generated by StyleNet provide global semantic cues and domain-specific characteristics via cross-attention and adaptive normalization mechanisms.

During training, only OmniNet and the projection layers of the IP-Adapter are optimized, while GarmentNet and StyleNet remain in their pre-trained states. With this training strategy, our model can robustly handle diverse combinations of human attributes, garments, and styles within a single unified network, simultaneously achieving human structure preservation, garment appearance fidelity, and controllable style transformation.

5. Implementation and Results

5.1. Implementation Detail

We implemented our model in a cloud platform with Intel Xeon Pentium 8480 CPU and nVidia H100 GPU with 80 GB. Among the modules of our mode, StyleNet is implemented with a CLIP encoder since the embedding loss function is employed to guide image transformations. Our training process is focused in OmniNet and IP-Adapter. OmniNet, which applies the cloth information to the character in a scene, is properly trained through the segmentation maps which corresponds to the cloth of the target character. We supply 253 characters with their cloth maps for the training of OmniNet and IP-Adapter. This training process takes 1.5 h for each category of the characters.

5.2. Results

As a result of our framework, we select four styles including photorealistic style, animation style, webtoon style, and watercolor style. We sample four characters including man, woman, old person, and child. For each character, we apply six different sets of cloths. The results are presented in Figure 2, Figure 3, Figure 4 and Figure 5. In the three upper rows of the figures, cloths composed of upper cloth and lower cloth are applied to the characters. Common cloths are applied to the four characters. In the lower three rows of the figures, we apply different cloths to male characters and female characters. Man and old person belong to male characters, while woman and child belong to female characters. They show various poses in order to prove that our framework is applied to characters of various poses.

6. Evaluation

6.1. Comparison

We sample four existing studies for the comparison of our results: CatVTON [6], stableVITON [4], OOTDiffusion [11], and VITON-HD [19]. Among these studies, VITON-HD is applied only for the upper cloth. We execute the comparison for four characters and six cloths, which produces twenty-four results. The same twenty-four results are also generated by the four existing studies that we compare. We choose four human figures—man, woman, old person, and child—for our comparison. These four types of human figures are the most typical types of human figures. In Figure 6, Figure 7, Figure 8 and Figure 9, some sampled results are presented.

6.2. Quantitative Evaluation

Forthe evaluation of our results, we estimate the following four metrics from the results produced by our method and the comparing four studies: CLIP score for person, CLIP score for cloth, Frechet Inception Distance (FID), and Kernel Inception Distance (KID). The CLIP score is a metric that measures how well a given scene aligns with the provided textual conditions. In our study, the quality of the generated images is evaluated by computing CLIP scores using captions assigned to the generated images. In particular, the evaluation separately measures the CLIP score for characters (CLIP score for person) and the CLIP score for clothing (CLIP score for cloth), enabling an independent assessment of how accurately the characters are generated and how precisely the clothing is rendered.

To complement the CLIP score, we additionally estimate FID and KID. FID is a metric that evaluates the visual quality and diversity of the generated videos, while KID serves as a complementary metric to FID by measuring distributional differences using a maximum mean discrepancy-based kernel without assuming a Gaussian distribution.

The results of the estimation are presented in Table 1 and Table 2. We present the estimated metrics on the images in photorealistic style and animation style in Table 1, and the estimated metrics on the images in webtoon style and watercolor style in Table 2. We note that our results record 15 best scores among 16 metrics. Ours record 12 best scores among 16 metrics for animation style, 9 best scores among 16 metrics for webtoon style, and 13 best scores among 16 metrics for watercolor style. From these tables, we confirm that our model outperforms the sampled four models.

Among the four evaluated styles, the photorealistic style achieves the highest overall performance, while the webtoon style exhibits a relatively smaller number of metrics in which our method outperforms the compared approaches (9 out of 16). This behavior can be attributed to the inherent characteristics of the webtoon style, which tends to simplify complex visual patterns for stylized depiction. As a result, the strengths of our model—particularly its ability to faithfully preserve fine-grained and intricate garment patterns—are less prominently reflected under the webtoon-style setting.

6.3. Cross-Dataset Generalization Discussion

The quantitative evaluation of our study is primarily conducted on the DressCode dataset, which contains diverse garment categories, including tops, bottoms, and dresses, and is well suited for analyzing fine-grained garment control and style adaptation performance. Nevertheless, evaluations based on a single dataset may be insufficient to fully demonstrate the generalization capability of our model.

For instance, other Virtual Try-On datasets such as VITON-HD exhibit distinct characteristics in terms of pose distributions, garment appearances, and image resolutions. These differences are particularly important when considering real-world deployment scenarios.

Our method adopts a disentangled processing strategy that separately handles human structure, garment appearance, and style information. Owing to this design, our model is expected to exhibit relatively stable behavior even under distribution shifts across datasets. In future work, we plan to further validate this property through additional experiments, including cross-dataset evaluations, to more clearly assess the generalization performance of our approach.

6.4. Ablation Study

To more accurately evaluate the effectiveness of our approach, we conduct an ablation study on two key components of our model: the IP-Adapter and GarmentNet. As illustrated in Figure 10a, the IP-Adapter plays a crucial role in preserving the accurate shape, color, and pattern of the target garment. When the IP-Adapter is removed, the generated results suffer from noticeable mismatches in garment shape and color compared to the input garment.

Furthermore, as shown in Figure 10b, GarmentNet is responsible for delivering garment appearance features to the diffusion backbone. In the absence of GarmentNet, the desired garment specified by the user is not properly reflected in the generated output, and severe visual artifacts are observed.

For the results presented in Figure 10, we additionally perform a quantitative evaluation using the metrics adopted in the previous section. The measured CLIP score, FID, and KID are presented in Table 3, which quantitatively supports the qualitative observations from the ablation analysis.

7. Conclusions and Future Work

In this study, we proposed a generative Virtual Try-On framework capable of consistently handling diverse characters, garments, and visual styles required in story visualization scenarios. Our approach is built upon an integrated architecture composed of OmniNet, which captures human structure and pose; GarmentNet, which explicitly extracts garment appearance features; and StyleNet, which controls domain-specific visual styles. By modeling human, garment, and style information as functionally disentangled conditioning signals and integrating them within the self-attention and cross-attention mechanisms of a diffusion model, the proposed framework simultaneously achieves human structure preservation, garment appearance fidelity, and controllable style transformation. Extensive experiments across multiple style domains and garment categories demonstrate that our method delivers superior quantitative performance and stable visual quality compared to existing virtual try-on approaches.

In future work, we plan to further investigate the generalization capability of the proposed framework through cross-dataset evaluations and additional experiments using more complex and realistic datasets. We also aim to extend our model to better handle special-purpose garments and multi-layered clothing structures commonly encountered in games, fantasy films, and immersive virtual environments. Furthermore, by enhancing continuous style control and incorporating user-interactive conditioning mechanisms, we seek to evolve our framework into a more versatile Virtual Try-On solution applicable to a wide range of generative content creation pipelines beyond story visualization.

Author Contributions

Conceptualization, K.M.; Methodology, W.C.; Software, W.C.; Validation, H.Y.; Resources, W.C.; Data curation, W.C.; Writing—original draft, K.M.; Writing—review & editing, H.Y.; Supervision, H.Y. and K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because they are part of an ongoing study. Requests to access the datasets should be directed to yanghk@smu.ac.kr. The copyright of the photographs used in our manuscript falls under “Fair Use”.

Acknowledgments

This research was supported by Sangmyung University at 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NeurIPS 2020, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the ICLR 2021, Vienna, Austria, 4 May 2021; pp. 1–25. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 8176–8185. [Google Scholar]
Choi, Y.; Park, S.; Lee, S.; Kwak, J.; Choo, J. IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-On in the Wild. In Proceedings of the ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 206–235. [Google Scholar]
Chong, Z.; Dong, X.; Li, H.; Zhang, S.; Zhang, W.; Zhang, X.; Zhao, H.; Jiang, D.; Liang, X. CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. In Proceedings of the ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar]
Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 8580–8589. [Google Scholar]
Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 4606–4615. [Google Scholar]
Gou, J.; Sun, S.; Zhang, J.; Si, J.; Qian, C.; Zhang, L. Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 7599–7607. [Google Scholar]
Ning, S.; Wang, D.; Qin, Y.; Jin, Z.; Wang, B.; Han, X. Picture: Photorealistic Virtual Try-On from Unconstrained Designs. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 6976–6985. [Google Scholar]
Xu, Y.; Gu, T.; Chen, W.; Chen, C. OOTDiffusion: Outfitting Fusion-based Latent Diffusion for Controllable Virtual Try-On. In Proceedings of the AAAI 2025, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8996–9004. [Google Scholar]
Han, X.; Wu, Z.; Wu, Z.; Yu, R.; Davis, L. VITON: An Image-based Virtual Try-on Network. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7543–7552. [Google Scholar]
Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; Yang, M. Toward Characteristic-Preserving Image-based Virtual Try-On Network. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 589–604. [Google Scholar]
Ge, C.; Song, Y.; Ge, Y.; Yang, H.; Liu, W.; Luo, P. Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 16928–16937. [Google Scholar]
Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; Luo, P. Parser-Free Virtual Try-On via Distilling Appearance Flows. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 8485–8493. [Google Scholar]
Issenhuth, T.; Mary, J.; Calauzenes, C. Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On. In Proceedings of the ECCV 2020, Glasgow, Scotland, 23 August 2020; pp. 619–635. [Google Scholar]
Cui, A.; Mahajan, J.; Shah, V.; Gomathinayagam, P.; Lazebnik, S. Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 8235–8239. [Google Scholar]
Xie, Z.; Huang, Z.; Dong, X.; Zhao, F.; Dong, H.; Zhang, X.; Zhu, F.; Liang, X. GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 23550–23559. [Google Scholar]
Choi, S.; Park, S.; Lee, M.; Choo, J. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 14131–14140. [Google Scholar]
Lee, S.; Gu, G.; Park, S.; Choi, S.; Choo, J. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 204–219. [Google Scholar]
Morelli, D.; Fincato, M.; Cornia, M.; Landi, F.; Cesari, F.; Cucchiara, R. Dress Code: High-Resolution Multi-Category Virtual Try-On. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2231–2235. [Google Scholar]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 1–12. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the ICCV 2023, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y.; Qie, X. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 1–10. [Google Scholar]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text-Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. In Proceedings of the NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; pp. 1–12. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]

Figure 1. The overview of our framework: The input to our system is composed of a style prompt, which is processed in StyleNet; a garment image (

x_{g}

) and its garment mask (

m_{g}

), which are processed in GarmentNet; and a person image (

x_{p}

) and its mask map (

m_{p}

), which are processed in OmniNet. In OmniNet, the self-attention module processes low-level features such as texture pattern and the cross-attention module processes high-level features such as body, garment, and style.

Figure 1. The overview of our framework: The input to our system is composed of a style prompt, which is processed in StyleNet; a garment image (

x_{g}

) and its garment mask (

m_{g}

), which are processed in GarmentNet; and a person image (

x_{p}

) and its mask map (

m_{p}

), which are processed in OmniNet. In OmniNet, the self-attention module processes low-level features such as texture pattern and the cross-attention module processes high-level features such as body, garment, and style.

Figure 2. Result images (1): photorealistic style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.

Figure 3. Result images (2): Animation style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.

Figure 4. Result images (3): webtoon style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.

Figure 5. Result images (4): Watercolor style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.

Figure 6. Comparison (1): photorealistic style: man and child are compared.

Figure 7. Comparison (2): Animation style: woman and old person are compared.

Figure 8. Comparison (3): Webtoon style: man and old person are compared.

Figure 9. Comparison (4): Watercolor style: woman and child are compared.

Figure 10. Ablation study for visual contents.

Table 1. Comparison on photorealistic style and animation style. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.

Style	Target	Metric	Ours	CatVTON	StableVITON	OOTDiffusion	VITON-HD
photorealistic	male	CLIP (for person) ↑	0.948	0.945	0.866	0.795	0.925
		CLIP (for cloth) ↓	0.546	0.479	0.510	0.526	0.483
		FID ↓	113.530	136.284	221.039	218.031	132.419
		KID ↓	0.019	0.023	0.020	0.036	0.037
	female	CLIP (for person) ↑	0.905	0.903	0.827	0.821	0.895
		CLIP (for cloth) ↓	0.646	0.506	0.555	0.611	0.523
		FID ↓	159.585	167.863	253.367	242.290	171.420
		KID ↓	0.010	0.021	0.018	0.021	0.041
	old person	CLIP (for person) ↑	0.927	0.943	0.818	0.812	0.907
		CLIP (for cloth) ↓	0.555	0.447	0.500	0.545	0.463
		FID ↓	236.439	114.369	265.059	242.273	158.976
		KID ↓	0.009	0.091	0.013	0.026	0.073
	child	CLIP (for person) ↑	0.930	0.924	0867	0.832	0.892
		CLIP (for cloth) ↓	0.593	0.496	0.536	0.565	0.524
		FID ↓	162.779	171.584	218.992	223.761	174.003
		KID ↓	0.021	0.025	0.061	0.058	0.022
animation	male	CLIP (for person) ↑	0.953	0.947	0.825	0.872	0.948
		CLIP (for cloth) ↓	0.579	0.478	0.541	0.543	0.490
		FID ↓	103.875	154.046	193.646	206.615	176.143
		KID ↓	0.005	0.038	0.011	0.014	0.050
	female	CLIP (forvs person) ↑	0.942	0.930	0.861	0.841	0911
		CLIP (for cloth) ↓	0.557	0.468	0.511	0.542	0.484
		FID ↓	155.628	157.243	201.373	206.168	201.903
		KID ↓	0.013	0.022	0.014	0.017	0.007
	old person	CLIP (for person) ↑	0.940	0.949	0.799	0.816	0.941
		CLIP (for cloth) ↓	0.577	0.439	0.517	0.526	0.472
		FID ↓	100.282	103.348	234.856	239.627	143.891
		KID ↓	0.005	0.062	0.001	0.011	0.058
	child	CLIP (for person) ↑	0.949	0.948	0869	0.894	0.943
		CLIP (for cloth) ↓	0.567	0.458	0.520	0.490	0.471
		FID ↓	186.895	104.953	271.145	197.011	156.727
		KID ↓	0.014	0.044	0.038	0.016	0.036

Table 2. Comparison on webtoon style and watercolor styles. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.

Style	Target	Metric	Ours	CatVTON	StableVITON	OOTDiffusion	VITON-HD
webtoon	male	CLIP (for person) ↑	0.938	0.914	0.854	0.820	0.940
		CLIP (for cloth) ↓	0.541	0.481	0.518	0.543	0.486
		FID ↓	109.101	175.445	180.319	227.459	159.785
		KID ↓	0.025	0.026	0.003	0.032	0.022
	female	CLIP (for person) ↑	0.947	0.926	0.919	0.844	0.923
		CLIP (for cloth) ↓	0.560	0.474	0.507	0.552	0.492
		FID ↓	154.373	168.693	175.951	205.340	189.056
		KID ↓	0.005	0.008	0.007	0.011	0.020
	old person	CLIP (for person) ↑	0.936	0.945	0.961	0.817	0.925
		CLIP (for cloth) ↓	0.551	0.451	0.517	0.520	0.485
		FID ↓	136.612	144.740	204.457	190.029	150.324
		KID ↓	0.028	0.046	0.005	0.014	0.045
	child	CLIP (for person) ↑	0.942	0.949	0.836	0.883	0.942
		CLIP (for cloth) ↓	0.582	0.453	0.517	0.491	0.466
		FID ↓	111.999	119.561	216.960	213.377	159.754
		KID ↓	0.024	0.032	0.017	0.007	0.014
watercolor	male	CLIP (for person) ↑	0.933	0.899	0.900	0.829	0.936
		CLIP (for cloth) ↓	0.624	0.504	0.522	0.580	0.510
		FID ↓	109.170	165.194	183.500	237.846	115.074
		KID ↓	0.015	0.026	0.002	0.007	0.064
	female	CLIP (for person) ↑	0.927	0.888	0.900	0.857	0917
		CLIP (for cloth) ↓	0.553	0.489	0.525	0.552	0.520
		FID ↓	165.093	188.962	192.864	222.935	179.176
		KID ↓	0.001	0.035	0.009	0.008	0.014
	old person	CLIP (for person) ↑	0.912	0.933	0.821	0.843	0.927
		CLIP (for cloth) ↓	0.553	0.406	0.493	0.503	0.446
		FID ↓	208.356	144.734	251.704	217.168	144.041
		KID ↓	0.005	0.068	0.009	0.028	0.059
	child	CLIP (for person) ↑	0.951	0.941	0.887	0.848	0.933
		CLIP (for cloth) ↓	0.570	0.497	0.541	0.558	0.505
		FID ↓	116.212	148.449	181.683	209.659	120.766
		KID ↓	0.001	0.029	0.027	0.050	0.003

Table 3. Comparison. The upward arrow (↑) indicates that higher values correspond to better performance, while the downward arrow (↓) indicates that lower values correspond to better performance. of ablation study.

	Image	Metric	With	Without
(a) IP-adapter	upper	CLIP (for cloth) ↑	0.924	0.432
		FID ↓	100.454	120.545
		KID ↓	0.024	0.035
	lower	CLIP (for cloth) ↑	0.952	0.632
		FID ↓	164.023	188.152
		KID ↓	0.002	0.012
(b) GarmentNet	upper	CLIP (for cloth) ↑	0.911	0.824
		FID ↓	135.312	144.251
		KID ↓	0.025	0.046
	lower	CLIP (for cloth) ↑	0.852	0.615
		FID ↓	140.125	200.254
		KID ↓	0.012	0.023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, W.; Yang, H.; Min, K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics 2026, 15, 514. https://doi.org/10.3390/electronics15030514

AMA Style

Choi W, Yang H, Min K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics. 2026; 15(3):514. https://doi.org/10.3390/electronics15030514

Chicago/Turabian Style

Choi, Wooseok, Heekyung Yang, and Kyungha Min. 2026. "A Style-Adapted Virtual Try-On Technique for Story Visualization" Electronics 15, no. 3: 514. https://doi.org/10.3390/electronics15030514

APA Style

Choi, W., Yang, H., & Min, K. (2026). A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics, 15(3), 514. https://doi.org/10.3390/electronics15030514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Style-Adapted Virtual Try-On Technique for Story Visualization

Abstract

1. Introduction

2. Related Work

2.1. Early VTON Techniques for Garment Alignment

2.2. Diffusion-Based VTON Techniques

2.2.1. Early Diffusion-Based VTON Techniques

2.2.2. End-to-End Structure Diffusion-Based VTON Techniques

3. Outline

4. Our Method

4.1. StyleNet

4.2. GarmentNet

4.3. OmniNet

4.3.1. Garment Semantic Conditioning via IP-Adapter

4.3.2. Human-Garment Appearance Fusion via Self-Attention

4.3.3. Style Conditioning via Adaptive Layer Normalization

4.4. Loss Function

4.5. Training Strategy for Human and Garment Adaptation

5. Implementation and Results

5.1. Implementation Detail

5.2. Results

6. Evaluation

6.1. Comparison

6.2. Quantitative Evaluation

6.3. Cross-Dataset Generalization Discussion

6.4. Ablation Study

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI