Next Article in Journal
SymXplorer: Symbolic Analog Topology Exploration of a Tunable Common-Gate Bandpass TIA for Radio-over-Fiber Applications
Previous Article in Journal
Deeply Pipelined NTT Accelerator with Ping-Pong Memory and LUT-Only Barrett Reduction for Post-Quantum Cryptography
Previous Article in Special Issue
Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Style-Adapted Virtual Try-On Technique for Story Visualization

1
Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea
2
Department of Software, Sangmyung University, Cheonan 31066, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2026, 15(3), 514; https://doi.org/10.3390/electronics15030514
Submission received: 14 December 2025 / Revised: 14 January 2026 / Accepted: 23 January 2026 / Published: 25 January 2026
(This article belongs to the Special Issue Application of Machine Learning in Graphics and Images, 2nd Edition)

Abstract

We propose a novel clothing application technique designed for story visualization framework where various characters appear wearing a wide range of outfits. To achieve our goal, we extend a Virtual Try-On framework for synthetic garment fitting. Conventional Virtual Try-On methods are limited to generating images of a single person wearing a restricted set of clothes within a fixed style domain. To overcome these limitations, we apply an improved Virtual Try-On model trained with appropriately processed datasets, enabling the generation of upper and lower garments separately across diverse characters and producing images in four distinct styles: photorealistic, webtoon, animation, and watercolor. Our system collects character images and clothing images and performs accurate masking of garment regions. Our system takes a style-specific text prompt as input. Based on these inputs, garment-specific conditioning is applied to synthesize the clothing, followed by a cross-style diffusion process that generates Virtual Try-On images reflecting multiple visual styles. Our approach significantly enhances the adaptability and stylistic diversity of Virtual Try-On technology for story visualization applications.

1. Introduction

Recent advances in generative artificial intelligence have transformed character creation and visual representation across a wide range of visual content production domains, including webtoons, animation, games, and video media. In particular, diffusion-based image generation models [1,2,3] have emerged as a powerful tool capable of producing high-quality visual results. In particular, Virtual Try-On (VTON) techniques built upon these models have been actively studied for garment synthesis and human image generation. Despite this technological progress, however, existing VTON frameworks remain limited in their ability to adequately support content production scenarios such as story visualization, in which the same character repeatedly appears wearing diverse outfits and exhibiting multiple visual styles.
Most existing VTON studies [4,5,6,7,8,9,10,11] have been developed primarily within the photorealistic domain, focusing on synthesizing garments for a single person under a fixed visual style. While these approaches have significantly improved garment alignment and visual realism, their extension to non-photorealistic style domains, such as animation, webtoon, and watercolor, has been relatively limited. Moreover, many existing frameworks fail to provide sufficient fine-grained garment controllability, such as independent control of upper and lower garments or partial garment replacement. In addition, style transformation, human structure preservation, and garment alignment are often handled in separate stages or independent modules, leading to fragmented pipelines. These structural limitations reduce visual consistency and practical usability in story-based visualization environments where diverse characters and styles coexist.
Virtual Try-On for story visualization goes beyond a simple image synthesis problem and simultaneously requires preservation of human structural identity, retention of garment-specific texture and shape, and consistent visual transformation across different style domains. However, in prior studies, style information is typically expressed at the text-prompt level or applied as a post-processing step, while garment information is often handled through warping operations or latent concatenation. As a result, methodological approaches that systematically disentangle and integrate person, garment, and style information as independent yet interacting conditioning signals within the attention structure of diffusion models have not been sufficiently explored. This gap highlights a critical technical limitation in supporting stable and controllable combinations of diverse visual styles and garments.
To address these limitations, our work reformulates the Virtual Try-On problem as a multi-domain conditional diffusion generation problem for story visualization. We propose StyleVTON, a unified framework composed of three specialized modules: OmniNet, which captures human structure and pose; GarmentNet, which extracts garment texture and shape; and StyleNet, which controls visual style domains. Our framework separates person, garment, and style information into independent conditioning signals and integrates them into the self-attention and cross-attention layers of a diffusion model. This design enables the stable generation of diverse character–garment–style combinations within a single network. Unlike conventional step-wise pipelines or simple conditional fusion strategies, our architecture is explicitly designed to ensure the visual consistency and controllability required in story visualization scenarios.
Our study aims to answer the following key research question: Can a single diffusion-based Virtual Try-On framework preserve human structural identity and garment-specific visual characteristics while consistently synthesizing garments across both photorealistic and diverse non-photorealistic visual styles (e.g., animation, webtoon, and watercolor)? Furthermore, we experimentally investigate whether modeling person, garment, and style information as functionally disentangled conditioning signals and integrating them within the attention structure of diffusion models can effectively enhance visual consistency and controllability compared to existing VTON methods in story visualization scenarios.
The main contributions of this paper can be summarized as follows. First, we formalize the Virtual Try-On problem for story visualization as a conditional generation task that simultaneously requires style invariance and garment sensitivity. Second, we propose a novel diffusion-based framework that integrates person, garment, and style information as functionally disentangled conditioning signals within the attention mechanism, offering methodological insights beyond simple module composition. Third, by combining style-token-based conditioning with garment-specific attention, we experimentally demonstrate that our approach maintains structural consistency and garment fidelity not only in photorealistic domains but also across diverse artistic styles such as animation, webtoon, and watercolor. Through these contributions, this work presents a scalable and practical diffusion-based Virtual Try-On framework tailored for story-driven visual content production.
This paper is organized as follows: Section 2 reviews the related works on Virtual Try-On and style transfer techniques, highlighting the limitations of current approaches. Section 3 proposes the outline of our method. Section 4 details our style-adapted VTON framework, including the formal definition of the dual-path architecture and the specific training strategy. Section 5 presents the experimental setup and dataset construction, and a comprehensive analysis of the quantitative and qualitative results compared to state-of-the-art methods is presented in Section 6. Finally, Section 7 concludes the paper with a summary of our findings and discussions on future research directions.

2. Related Work

2.1. Early VTON Techniques for Garment Alignment

Early Virtual Try-On (VTON) studies primarily focused on the problem of garment–human alignment as a core challenge, and proposed image warping-based or GAN-based approaches to address it. This line of research played a crucial role in formally defining the VTON problem and laid the foundation for a wide range of subsequent extensions.
VITON [12], one of the representative early works, employed Thin-Plate Spline (TPS) transformations to deform garments according to the target human silhouette, followed by a GAN-based synthesis network to generate the final image. CP-VTON [13] later introduced a GMM-based warping module to better preserve garment patterns, while Distilling Appearance Flows [14] reduced dependency on human parsing by learning appearance flow in a teacher–student manner, aiming for more stable alignment. Although these warping- or flow-based methods improved garment alignment accuracy, they still suffered from misalignment issues under complex pose variations, garment wrinkles, and occlusions.
Ge et al. [15] attempted a successive study to simultaneously achieve human structure preservation and visual realism by leveraging disentangled representations and cycle consistency. Parser-Free VTON [16] alleviated error propagation caused by inaccurate human parsing, while Street TryOn [17] targeted generalization in in-the-wild environments by accommodating diverse backgrounds and poses. Despite extending the applicability of VTON systems, these methods still relied on multi-stage pipelines, which inevitably led to accumulated errors and training instability.
To further improve generalization, GP-VTON [18] proposed a framework that combines global parsing with local garment alignment, enabling relatively stable synthesis under complex pose variations and diverse human conditions. While GP-VTON expanded the applicability of warping-based approaches, it still depended on external alignment modules and multi-stage pipelines, and therefore could not fundamentally resolve error accumulation and identity drift.
High-resolution synthesis-oriented approaches such as VITON-HD [19], as well as High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions [20], extended conventional warping/GAN-based frameworks to high-resolution settings. Although these methods improved visual quality through normalization techniques and conditional processing, their reliance on external alignment modules remained a structural limitation. Similarly, Dress Code [21] expanded fine-grained garment control by supporting multiple garment categories, yet could not fully guarantee alignment stability for complex outfit combinations.
Overall, early VTON techniques handled garment alignment using image warping or GAN-based modules, which introduced inherent limitations such as accumulated alignment errors, identity drift, and increased pipeline complexity. These issues become particularly pronounced in story visualization scenarios, where diverse garments and poses are required and the same character repeatedly appears. Consequently, there is a growing demand for more integrated and stable generation frameworks. The OmniNet and GarmentNet modules used in our framework are designed to address these challenges.

2.2. Diffusion-Based VTON Techniques

2.2.1. Early Diffusion-Based VTON Techniques

Denoising diffusion models [1,2] and Stable Diffusion [3] have served as foundational technologies for Virtual Try-On (VTON) systems capable of synthesizing diverse garment images. Ho et al. [22] introduced Classifier-Free Diffusion Guidance (CFG), which first demonstrated that diffusion models can be controlled solely through conditional signals without relying on external classifiers. By jointly training unconditional and conditional models and adjusting guidance strength based on the divergence between their distributions, CFG significantly improved generation quality and has since become a fundamental component of most Stable Diffusion-based models.
Subsequent studies explored effective methods for injecting structural conditions—such as human pose, edges, and depth—into diffusion models. ControlNet, proposed by Zhang et al. [23], introduced an architecture that integrates external control signals by training them in parallel with the diffusion backbone. In the VTON domain, where human pose preservation is critical, ControlNet substantially improved alignment quality and established a new benchmark for conditional control. T2I-Adapter, proposed by Mou et al. [24], facilitates the injection of diverse conditions by inserting lightweight adapter modules into pre-trained diffusion models. Its flexibility in handling conditional information such as garment masks and body structures makes it particularly effective for enhancing conditional control in VTON pipelines.
Furthermore, Ye et al. [25] proposed an attention-level image conditioning approach through IP-Adapter, which injects CLIP image encoder outputs directly into the attention layers of the diffusion U-Net. This mechanism enables the stable transfer of image-based style, pattern, and texture information, and has been widely adopted as a core component in diffusion-based VTON systems for accurately reflecting garment characteristics.

2.2.2. End-to-End Structure Diffusion-Based VTON Techniques

Diffusion-based VTON research has expanded in multiple directions, centering on the latent space of Stable Diffusion. StableVITON, proposed by Kim et al. [4], directly learns semantic correspondence maps between garments and the human body within the latent space, resolving alignment internally during the diffusion process. This approach enables high-resolution synthesis with strong semantic consistency; however, learning correspondence during diffusion leads to slow inference and unstable alignment under complex pose variations.
TryOnDiffusion by Zhu et al. [8] introduced a dual-UNet architecture that separates garment deformation (warping UNet) from final image synthesis (synthesis UNet). While this decoupling reduces artifacts and improves alignment accuracy, it significantly increases memory and computational costs, and alignment errors from the warping stage may still propagate to the final output.
AF-Diffusion (Taming the Power of Diffusion with Appearance Flow) by Gou et al. [9] proposed a hybrid architecture that combines appearance flow with diffusion models to achieve both warping stability and high-quality generation. Although effective in preserving fine details, flow prediction limitations may still introduce artifacts for complex garments.
More recent diffusion-based VTON studies have become increasingly specialized depending on which structural aspects they aim to preserve. LaDI-VTON by Morelli et al. [7] employs textual inversion to learn garment identity embeddings and performs garment replacement while maintaining photorealistic quality within a latent diffusion framework. Despite offering strong controllability via text-based identity adjustment, its T2I embeddings tend to oversmooth complex garment patterns.
IDM-VTON by Choi et al. [5] focuses on preserving person identity using the latent U-Net of Stable Diffusion. By injecting garment semantic features through IP-Adapter, it achieves high structural alignment without relying on explicit occlusion warping. However, its dependence on IP-Adapter-based features makes it less effective for handling intrinsic garment deformations, often resulting in artifacts for complex clothing.
Recent studies have also explored multi-garment configurations and fine-grained structure control. CatVTON, proposed by Chong et al. [6], adopts a concise architecture that eliminates explicit warping and simply concatenates garment and human latents before feeding them into the diffusion U-Net. Category tokens are used to control garment types such as tops, bottoms, and dresses, enabling stable training and efficient synthesis. Nevertheless, this simple latent concatenation makes the model vulnerable to complex clothing structures such as layered outfits.
OOTDiffusion by Xu et al. [11] introduces an outfitting fusion module to directly control garment features in latent space, achieving superior alignment at human–garment boundaries and robust occlusion handling. It also supports multi-component outfits, allowing simultaneous processing of multiple garments, although interference between combined features may occur.
Beyond photographic inputs, alternative conditioning modalities have also been explored. The Picture model by Ning et al. [10] performs try-on using unconstrained garment designs such as sketches or drafts instead of existing garment photos. While it maintains strong global consistency in photorealistic human synthesis, reliance on sketches can sometimes lead to distortions in garment shape.
In summary, diffusion-based VTON research has evolved from general conditional generation frameworks toward structure-aware conditioning, attention-level image integration, and VTON-specific structural control. However, most existing approaches focus on preserving a single aspect—such as structure, garment identity, or person identity—while jointly modeling and integrating these conditions in a functionally disentangled yet unified manner remains limited. These limitations become especially apparent in story visualization scenarios that require diverse garments, multiple styles, and recurring characters.

3. Outline

In this study, we propose StyleVTON (Style Virtual Try-On), a diffusion-based Virtual Try-On model capable of handling diverse visual styles, subjects, and fine-grained garment control. Our model utilizes four input elements: a person image, a garment image, a mask map, and a style prompt. Based on these inputs, it generates a series of images in which the given garment is naturally applied to the target person with various styles. The architecture of our model is built around a diffusion-based U-Net backbone [26] and consists of the following three core modules:
  • StyleNet encodes style characteristics from various visual domains—including photorealistic, animation, webtoon, and watercolor—and injects them into the attention layers of the diffusion network.
  • GarmentNet extracts fine-grained visual features such as garment texture, shape, and color.
  • OmniNet captures the human structure and pose and restores the masked regions.
These three modules are integrated within the cross-attention and self-attention stages to generate visually consistent synthesized images based on the domain–garment–person combination. Figure 1 illustrates the overall pipeline of our framework.

4. Our Method

4.1. StyleNet

StyleNet is a module designed to encode visual domain–level style characteristics, such as color tone, luminance, contour, and texture. In this study, we consider four distinct style domains: photorealistic, webtoon, animation, and watercolor.
StyleNet employs a CLIP-ViT-H/14-based encoder to generate style embeddings from either textual style prompts or reference style images. The resulting style embedding is defined as
s i R d ,
where s i denotes the style token corresponding to the style domain d i . The style embedding s i produced by StyleNet serves as a global style representation that compactly captures domain-specific attributes, including hue, saturation, luminance, and contour characteristics. This embedding is utilized as an external conditioning signal for OmniNet throughout the entire diffusion process.

4.2. GarmentNet

GarmentNet is a module designed to extract visual appearance information of garments, including texture, color, pattern, and silhouette. In this study, garments are categorized into three types: tops, bottoms, and one-piece garments (e.g., dresses). All categories share an identical encoder architecture, while category-specific normalization parameters are applied to preserve morphological characteristics across different garment types.
Given a garment image x g , the garment region is isolated using a garment segmentation mask m g . The masked garment input is defined as
x g m = x g m g ,
where ⊙ denotes element-wise multiplication, and x g m represents the masked garment image with the background removed.
The masked garment input x g m is then transformed into a latent garment appearance feature through the GarmentNet encoder E ( ) :
f g = E x g m .
The extracted feature f g serves as a garment appearance representation that captures the local texture details and structural characteristics of the garment, and is subsequently provided to OmniNet as an external conditioning signal.

4.3. OmniNet

OmniNet is a diffusion-based U-Net backbone operating in the latent space, which reconstructs human body structure and pose while integrating conditional information extracted from GarmentNet and StyleNet to generate the final image.

4.3.1. Garment Semantic Conditioning via IP-Adapter

To effectively inject the semantic identity of garments into the attention space of the diffusion model, we introduce an IP-Adapter-based conditioning interface within OmniNet. The IP-Adapter takes the original garment image x g as input and produces key–value representations that can be directly injected into the attention mechanism:
( K i , V i ) = IP - Adapter x g .
The projection layers of the IP-Adapter are configured to be learnable, enabling the model to learn semantic alignment between garment semantic features and the human latent space.
Textual, style, and garment semantic conditions are jointly considered in the cross-attention stage of OmniNet according to the following equation:
A t t e n t i o n ( Q , K t , K s , K i , V t , V s , V i ) = s o f t m a x Q K t , K s , K i T d V t , V s , V i .
Here,
  • ( K t , V t ) denote the key–value pairs derived from the text prompt embeddings.
  • ( K s , V s ) denote the key–value pairs generated from the style embedding s i produced by StyleNet.
  • ( K i , V i ) denote the garment semantic features generated via the IP-Adapter.

4.3.2. Human-Garment Appearance Fusion via Self-Attention

Meanwhile, the garment appearance feature f g extracted from GarmentNet is fused with the human feature f p within OmniNet through a self-attention mechanism. Specifically, the joint representation of the two features is processed as
f p , f g = S e l f A t t e n t i o n f p , f g .
This self-attention operation selectively transfers garment texture information to spatially compatible human regions, thereby preserving the underlying human structure while stably reflecting fine-grained garment details. The resulting features after self-attention are subsequently forwarded to the decoder for final image synthesis.

4.3.3. Style Conditioning via Adaptive Layer Normalization

In addition, OmniNet is designed such that style embeddings directly influence the feature distributions through Adaptive Layer Normalization (AdaLN). The AdaLN operation is defined as
A d a L N ( x ) = γ s i · x μ ( x ) σ ( x ) + β s i ,
where γ and β denote the scale and shift parameters predicted from the style embedding s i . This formulation enables domain-specific color tone and luminance characteristics to be consistently reflected throughout the diffusion process.
Through this attention-based multi-source conditioning architecture, OmniNet simultaneously achieves the preservation of human structure, the maintenance of garment appearance, and controlled visual style transformation.

4.4. Loss Function

Our model is trained using diverse human data across four visual domains—photorealistic, webtoon, animation, and watercolor—covering a wide range of ages and genders. Each training sample consists of a tuple ( x p , m p , x g , m g , s i ) and the model is optimized by jointly minimizing a diffusion-based noise prediction loss together with style and garment consistency losses. The overall objective function is defined as
L = E x t , , ϵ ϵ ϵ θ x t ; , c , s i , t 2 2 + λ 1 L s t y l e + λ 2 L c l o t h ,
where ϵ θ ( · ) denotes the noise prediction network parameterized by θ .
The first term corresponds to the standard denoising score matching loss, which encourages OmniNet to learn a stable denoising process under the given conditions. During this process, textual prompts and style tokens are injected via cross-attention, while garment semantic information is integrated into the same attention space through the IP-Adapter. In contrast, garment appearance information is incorporated through the self-attention pathway within OmniNet.
The style consistency loss L s t y l e is a CLIP-based objective that constrains excessive changes in human structural representation when only the style token is altered for the same human–garment input. The garment fidelity loss L c l o t h is designed to ensure that the garment appearance features extracted by GarmentNet are faithfully preserved in the generated results.
By simultaneously providing domain tokens (style tokens) and garment-related conditions, the model is able to learn domain transfer (style transfer) and garment replacement (Virtual Try-On ) within a single unified training framework.

4.5. Training Strategy for Human and Garment Adaptation

Our model is designed to generate consistent results across diverse human types (e.g., gender, age, and body shape) and garment categories (tops, bottoms, and dresses). To this purpose, we adopt a dual-branch representation architecture that conceptually separates human representations and garment representations, which are subsequently fused within OmniNet through attention-based mechanisms.
Here, the dual-branch structure does not rely on an explicit, separate human encoder. Instead, it refers to the disentangled handling of (i) human structural features that naturally emerge within the diffusion U-Net encoder of OmniNet, and (ii) garment appearance features explicitly extracted by GarmentNet. The intermediate latent features inside OmniNet implicitly encode human structure and pose information during the denoising process, forming a pose-aware representation that is relatively robust to spatial deformations caused by variations in body shape or age.
Garment representations are extracted via GarmentNet, which learns appearance attributes such as silhouette, length, and wrinkle patterns through category-specific module designs. During training, the encoder parameters of GarmentNet are kept fixed (frozen) to prevent distortion of the original garment texture and color information.
These two representations are fused inside OmniNet through self-attention-based spatially aware fusion, enabling garment textures to be applied to spatially consistent human regions. This design allows stable garment synthesis without introducing distortions to the underlying human structure. Meanwhile, garment semantic features injected through the IP-Adapter and style tokens generated by StyleNet provide global semantic cues and domain-specific characteristics via cross-attention and adaptive normalization mechanisms.
During training, only OmniNet and the projection layers of the IP-Adapter are optimized, while GarmentNet and StyleNet remain in their pre-trained states. With this training strategy, our model can robustly handle diverse combinations of human attributes, garments, and styles within a single unified network, simultaneously achieving human structure preservation, garment appearance fidelity, and controllable style transformation.

5. Implementation and Results

5.1. Implementation Detail

We implemented our model in a cloud platform with Intel Xeon Pentium 8480 CPU and nVidia H100 GPU with 80 GB. Among the modules of our mode, StyleNet is implemented with a CLIP encoder since the embedding loss function is employed to guide image transformations. Our training process is focused in OmniNet and IP-Adapter. OmniNet, which applies the cloth information to the character in a scene, is properly trained through the segmentation maps which corresponds to the cloth of the target character. We supply 253 characters with their cloth maps for the training of OmniNet and IP-Adapter. This training process takes 1.5 h for each category of the characters.

5.2. Results

As a result of our framework, we select four styles including photorealistic style, animation style, webtoon style, and watercolor style. We sample four characters including man, woman, old person, and child. For each character, we apply six different sets of cloths. The results are presented in Figure 2, Figure 3, Figure 4 and Figure 5. In the three upper rows of the figures, cloths composed of upper cloth and lower cloth are applied to the characters. Common cloths are applied to the four characters. In the lower three rows of the figures, we apply different cloths to male characters and female characters. Man and old person belong to male characters, while woman and child belong to female characters. They show various poses in order to prove that our framework is applied to characters of various poses.

6. Evaluation

6.1. Comparison

We sample four existing studies for the comparison of our results: CatVTON [6], stableVITON [4], OOTDiffusion [11], and VITON-HD [19]. Among these studies, VITON-HD is applied only for the upper cloth. We execute the comparison for four characters and six cloths, which produces twenty-four results. The same twenty-four results are also generated by the four existing studies that we compare. We choose four human figures—man, woman, old person, and child—for our comparison. These four types of human figures are the most typical types of human figures. In Figure 6, Figure 7, Figure 8 and Figure 9, some sampled results are presented.

6.2. Quantitative Evaluation

Forthe evaluation of our results, we estimate the following four metrics from the results produced by our method and the comparing four studies: CLIP score for person, CLIP score for cloth, Frechet Inception Distance (FID), and Kernel Inception Distance (KID). The CLIP score is a metric that measures how well a given scene aligns with the provided textual conditions. In our study, the quality of the generated images is evaluated by computing CLIP scores using captions assigned to the generated images. In particular, the evaluation separately measures the CLIP score for characters (CLIP score for person) and the CLIP score for clothing (CLIP score for cloth), enabling an independent assessment of how accurately the characters are generated and how precisely the clothing is rendered.
To complement the CLIP score, we additionally estimate FID and KID. FID is a metric that evaluates the visual quality and diversity of the generated videos, while KID serves as a complementary metric to FID by measuring distributional differences using a maximum mean discrepancy-based kernel without assuming a Gaussian distribution.
The results of the estimation are presented in Table 1 and Table 2. We present the estimated metrics on the images in photorealistic style and animation style in Table 1, and the estimated metrics on the images in webtoon style and watercolor style in Table 2. We note that our results record 15 best scores among 16 metrics. Ours record 12 best scores among 16 metrics for animation style, 9 best scores among 16 metrics for webtoon style, and 13 best scores among 16 metrics for watercolor style. From these tables, we confirm that our model outperforms the sampled four models.
Among the four evaluated styles, the photorealistic style achieves the highest overall performance, while the webtoon style exhibits a relatively smaller number of metrics in which our method outperforms the compared approaches (9 out of 16). This behavior can be attributed to the inherent characteristics of the webtoon style, which tends to simplify complex visual patterns for stylized depiction. As a result, the strengths of our model—particularly its ability to faithfully preserve fine-grained and intricate garment patterns—are less prominently reflected under the webtoon-style setting.

6.3. Cross-Dataset Generalization Discussion

The quantitative evaluation of our study is primarily conducted on the DressCode dataset, which contains diverse garment categories, including tops, bottoms, and dresses, and is well suited for analyzing fine-grained garment control and style adaptation performance. Nevertheless, evaluations based on a single dataset may be insufficient to fully demonstrate the generalization capability of our model.
For instance, other Virtual Try-On datasets such as VITON-HD exhibit distinct characteristics in terms of pose distributions, garment appearances, and image resolutions. These differences are particularly important when considering real-world deployment scenarios.
Our method adopts a disentangled processing strategy that separately handles human structure, garment appearance, and style information. Owing to this design, our model is expected to exhibit relatively stable behavior even under distribution shifts across datasets. In future work, we plan to further validate this property through additional experiments, including cross-dataset evaluations, to more clearly assess the generalization performance of our approach.

6.4. Ablation Study

To more accurately evaluate the effectiveness of our approach, we conduct an ablation study on two key components of our model: the IP-Adapter and GarmentNet. As illustrated in Figure 10a, the IP-Adapter plays a crucial role in preserving the accurate shape, color, and pattern of the target garment. When the IP-Adapter is removed, the generated results suffer from noticeable mismatches in garment shape and color compared to the input garment.
Furthermore, as shown in Figure 10b, GarmentNet is responsible for delivering garment appearance features to the diffusion backbone. In the absence of GarmentNet, the desired garment specified by the user is not properly reflected in the generated output, and severe visual artifacts are observed.
For the results presented in Figure 10, we additionally perform a quantitative evaluation using the metrics adopted in the previous section. The measured CLIP score, FID, and KID are presented in Table 3, which quantitatively supports the qualitative observations from the ablation analysis.

7. Conclusions and Future Work

In this study, we proposed a generative Virtual Try-On framework capable of consistently handling diverse characters, garments, and visual styles required in story visualization scenarios. Our approach is built upon an integrated architecture composed of OmniNet, which captures human structure and pose; GarmentNet, which explicitly extracts garment appearance features; and StyleNet, which controls domain-specific visual styles. By modeling human, garment, and style information as functionally disentangled conditioning signals and integrating them within the self-attention and cross-attention mechanisms of a diffusion model, the proposed framework simultaneously achieves human structure preservation, garment appearance fidelity, and controllable style transformation. Extensive experiments across multiple style domains and garment categories demonstrate that our method delivers superior quantitative performance and stable visual quality compared to existing virtual try-on approaches.
In future work, we plan to further investigate the generalization capability of the proposed framework through cross-dataset evaluations and additional experiments using more complex and realistic datasets. We also aim to extend our model to better handle special-purpose garments and multi-layered clothing structures commonly encountered in games, fantasy films, and immersive virtual environments. Furthermore, by enhancing continuous style control and incorporating user-interactive conditioning mechanisms, we seek to evolve our framework into a more versatile Virtual Try-On solution applicable to a wide range of generative content creation pipelines beyond story visualization.

Author Contributions

Conceptualization, K.M.; Methodology, W.C.; Software, W.C.; Validation, H.Y.; Resources, W.C.; Data curation, W.C.; Writing—original draft, K.M.; Writing—review & editing, H.Y.; Supervision, H.Y. and K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because they are part of an ongoing study. Requests to access the datasets should be directed to yanghk@smu.ac.kr. The copyright of the photographs used in our manuscript falls under “Fair Use”.

Acknowledgments

This research was supported by Sangmyung University at 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NeurIPS 2020, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
  2. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the ICLR 2021, Vienna, Austria, 4 May 2021; pp. 1–25. [Google Scholar]
  3. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  4. Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 8176–8185. [Google Scholar]
  5. Choi, Y.; Park, S.; Lee, S.; Kwak, J.; Choo, J. IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-On in the Wild. In Proceedings of the ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 206–235. [Google Scholar]
  6. Chong, Z.; Dong, X.; Li, H.; Zhang, S.; Zhang, W.; Zhang, X.; Zhao, H.; Jiang, D.; Liang, X. CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. In Proceedings of the ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar]
  7. Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 8580–8589. [Google Scholar]
  8. Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 4606–4615. [Google Scholar]
  9. Gou, J.; Sun, S.; Zhang, J.; Si, J.; Qian, C.; Zhang, L. Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 7599–7607. [Google Scholar]
  10. Ning, S.; Wang, D.; Qin, Y.; Jin, Z.; Wang, B.; Han, X. Picture: Photorealistic Virtual Try-On from Unconstrained Designs. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 6976–6985. [Google Scholar]
  11. Xu, Y.; Gu, T.; Chen, W.; Chen, C. OOTDiffusion: Outfitting Fusion-based Latent Diffusion for Controllable Virtual Try-On. In Proceedings of the AAAI 2025, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8996–9004. [Google Scholar]
  12. Han, X.; Wu, Z.; Wu, Z.; Yu, R.; Davis, L. VITON: An Image-based Virtual Try-on Network. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7543–7552. [Google Scholar]
  13. Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; Yang, M. Toward Characteristic-Preserving Image-based Virtual Try-On Network. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 589–604. [Google Scholar]
  14. Ge, C.; Song, Y.; Ge, Y.; Yang, H.; Liu, W.; Luo, P. Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 16928–16937. [Google Scholar]
  15. Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; Luo, P. Parser-Free Virtual Try-On via Distilling Appearance Flows. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 8485–8493. [Google Scholar]
  16. Issenhuth, T.; Mary, J.; Calauzenes, C. Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On. In Proceedings of the ECCV 2020, Glasgow, Scotland, 23 August 2020; pp. 619–635. [Google Scholar]
  17. Cui, A.; Mahajan, J.; Shah, V.; Gomathinayagam, P.; Lazebnik, S. Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 8235–8239. [Google Scholar]
  18. Xie, Z.; Huang, Z.; Dong, X.; Zhao, F.; Dong, H.; Zhang, X.; Zhu, F.; Liang, X. GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 23550–23559. [Google Scholar]
  19. Choi, S.; Park, S.; Lee, M.; Choo, J. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 14131–14140. [Google Scholar]
  20. Lee, S.; Gu, G.; Park, S.; Choi, S.; Choo, J. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 204–219. [Google Scholar]
  21. Morelli, D.; Fincato, M.; Cornia, M.; Landi, F.; Cesari, F.; Cucchiara, R. Dress Code: High-Resolution Multi-Category Virtual Try-On. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2231–2235. [Google Scholar]
  22. Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 1–12. [Google Scholar]
  23. Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the ICCV 2023, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
  24. Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y.; Qie, X. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 1–10. [Google Scholar]
  25. Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text-Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. In Proceedings of the NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; pp. 1–12. [Google Scholar]
  26. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Figure 1. The overview of our framework: The input to our system is composed of a style prompt, which is processed in StyleNet; a garment image ( x g ) and its garment mask ( m g ), which are processed in GarmentNet; and a person image ( x p ) and its mask map ( m p ), which are processed in OmniNet. In OmniNet, the self-attention module processes low-level features such as texture pattern and the cross-attention module processes high-level features such as body, garment, and style.
Figure 1. The overview of our framework: The input to our system is composed of a style prompt, which is processed in StyleNet; a garment image ( x g ) and its garment mask ( m g ), which are processed in GarmentNet; and a person image ( x p ) and its mask map ( m p ), which are processed in OmniNet. In OmniNet, the self-attention module processes low-level features such as texture pattern and the cross-attention module processes high-level features such as body, garment, and style.
Electronics 15 00514 g001
Figure 2. Result images (1): photorealistic style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Figure 2. Result images (1): photorealistic style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Electronics 15 00514 g002
Figure 3. Result images (2): Animation style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Figure 3. Result images (2): Animation style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Electronics 15 00514 g003
Figure 4. Result images (3): webtoon style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Figure 4. Result images (3): webtoon style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Electronics 15 00514 g004
Figure 5. Result images (4): Watercolor style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Figure 5. Result images (4): Watercolor style. The columns correspond to man, woman, old man, old woman, boy, and girl, respectively.
Electronics 15 00514 g005
Figure 6. Comparison (1): photorealistic style: man and child are compared.
Figure 6. Comparison (1): photorealistic style: man and child are compared.
Electronics 15 00514 g006
Figure 7. Comparison (2): Animation style: woman and old person are compared.
Figure 7. Comparison (2): Animation style: woman and old person are compared.
Electronics 15 00514 g007
Figure 8. Comparison (3): Webtoon style: man and old person are compared.
Figure 8. Comparison (3): Webtoon style: man and old person are compared.
Electronics 15 00514 g008
Figure 9. Comparison (4): Watercolor style: woman and child are compared.
Figure 9. Comparison (4): Watercolor style: woman and child are compared.
Electronics 15 00514 g009
Figure 10. Ablation study for visual contents.
Figure 10. Ablation study for visual contents.
Electronics 15 00514 g010
Table 1. Comparison on photorealistic style and animation style. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.
Table 1. Comparison on photorealistic style and animation style. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.
StyleTargetMetricOursCatVTONStableVITONOOTDiffusionVITON-HD
photorealisticmaleCLIP (for person) ↑0.9480.9450.8660.7950.925
CLIP (for cloth) ↓0.5460.4790.5100.5260.483
FID ↓113.530136.284221.039218.031132.419
KID ↓0.0190.0230.0200.0360.037
femaleCLIP (for person) ↑0.9050.9030.8270.8210.895
CLIP (for cloth) ↓0.6460.5060.5550.6110.523
FID ↓159.585167.863253.367242.290171.420
KID ↓0.0100.0210.0180.0210.041
old personCLIP (for person) ↑0.9270.9430.8180.8120.907
CLIP (for cloth) ↓0.5550.4470.5000.5450.463
FID ↓236.439114.369265.059242.273158.976
KID ↓0.0090.0910.0130.0260.073
childCLIP (for person) ↑0.9300.92408670.8320.892
CLIP (for cloth) ↓0.5930.4960.5360.5650.524
FID ↓162.779171.584218.992223.761174.003
KID ↓0.0210.0250.0610.0580.022
animationmaleCLIP (for person) ↑0.9530.9470.8250.8720.948
CLIP (for cloth) ↓0.5790.4780.5410.5430.490
FID ↓103.875154.046193.646206.615176.143
KID ↓0.0050.0380.0110.0140.050
femaleCLIP (forvs person) ↑0.9420.9300.8610.8410911
CLIP (for cloth) ↓0.5570.4680.5110.5420.484
FID ↓155.628157.243201.373206.168201.903
KID ↓0.0130.0220.0140.0170.007
old personCLIP (for person) ↑0.9400.9490.7990.8160.941
CLIP (for cloth) ↓0.5770.4390.5170.5260.472
FID ↓100.282103.348234.856239.627143.891
KID ↓0.0050.0620.0010.0110.058
childCLIP (for person) ↑0.9490.94808690.8940.943
CLIP (for cloth) ↓0.5670.4580.5200.4900.471
FID ↓186.895104.953271.145197.011156.727
KID ↓0.0140.0440.0380.0160.036
Table 2. Comparison on webtoon style and watercolor styles. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.
Table 2. Comparison on webtoon style and watercolor styles. ↑ denotes that higher values are better and ↓ denotes that lower values are better. Bold text indicates the best results.
StyleTargetMetricOursCatVTONStableVITONOOTDiffusionVITON-HD
webtoonmaleCLIP (for person) ↑0.9380.9140.8540.8200.940
CLIP (for cloth) ↓0.5410.4810.5180.5430.486
FID ↓109.101175.445180.319227.459159.785
KID ↓0.0250.0260.0030.0320.022
femaleCLIP (for person) ↑0.9470.9260.9190.8440.923
CLIP (for cloth) ↓0.5600.4740.5070.5520.492
FID ↓154.373168.693175.951205.340189.056
KID ↓0.0050.0080.0070.0110.020
old personCLIP (for person) ↑0.9360.9450.9610.8170.925
CLIP (for cloth) ↓0.5510.4510.5170.5200.485
FID ↓136.612144.740204.457190.029150.324
KID ↓0.0280.0460.0050.0140.045
childCLIP (for person) ↑0.9420.9490.8360.8830.942
CLIP (for cloth) ↓0.5820.4530.5170.4910.466
FID ↓111.999119.561216.960213.377159.754
KID ↓0.0240.0320.0170.0070.014
watercolormaleCLIP (for person) ↑0.9330.8990.9000.8290.936
CLIP (for cloth) ↓0.6240.5040.5220.5800.510
FID ↓109.170165.194183.500237.846115.074
KID ↓0.0150.0260.0020.0070.064
femaleCLIP (for person) ↑0.9270.8880.9000.8570917
CLIP (for cloth) ↓0.5530.4890.5250.5520.520
FID ↓165.093188.962192.864222.935179.176
KID ↓0.0010.0350.0090.0080.014
old personCLIP (for person) ↑0.9120.9330.8210.8430.927
CLIP (for cloth) ↓0.5530.4060.4930.5030.446
FID ↓208.356144.734251.704217.168144.041
KID ↓0.0050.0680.0090.0280.059
childCLIP (for person) ↑0.9510.9410.8870.8480.933
CLIP (for cloth) ↓0.5700.4970.5410.5580.505
FID ↓116.212148.449181.683209.659120.766
KID ↓0.0010.0290.0270.0500.003
Table 3. Comparison. The upward arrow (↑) indicates that higher values correspond to better performance, while the downward arrow (↓) indicates that lower values correspond to better performance. of ablation study.
Table 3. Comparison. The upward arrow (↑) indicates that higher values correspond to better performance, while the downward arrow (↓) indicates that lower values correspond to better performance. of ablation study.
ImageMetricWithWithout
(a) IP-adapterupperCLIP (for cloth) ↑0.9240.432
FID ↓100.454120.545
KID ↓0.0240.035
lowerCLIP (for cloth) ↑0.9520.632
FID ↓164.023188.152
KID ↓0.0020.012
(b) GarmentNetupperCLIP (for cloth) ↑0.9110.824
FID ↓135.312144.251
KID ↓0.0250.046
lowerCLIP (for cloth) ↑0.8520.615
FID ↓140.125200.254
KID ↓0.0120.023
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, W.; Yang, H.; Min, K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics 2026, 15, 514. https://doi.org/10.3390/electronics15030514

AMA Style

Choi W, Yang H, Min K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics. 2026; 15(3):514. https://doi.org/10.3390/electronics15030514

Chicago/Turabian Style

Choi, Wooseok, Heekyung Yang, and Kyungha Min. 2026. "A Style-Adapted Virtual Try-On Technique for Story Visualization" Electronics 15, no. 3: 514. https://doi.org/10.3390/electronics15030514

APA Style

Choi, W., Yang, H., & Min, K. (2026). A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics, 15(3), 514. https://doi.org/10.3390/electronics15030514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop