A Style-Adapted Virtual Try-On Technique for Story Visualization
Abstract
1. Introduction
2. Related Work
2.1. Early VTON Techniques for Garment Alignment
2.2. Diffusion-Based VTON Techniques
2.2.1. Early Diffusion-Based VTON Techniques
2.2.2. End-to-End Structure Diffusion-Based VTON Techniques
3. Outline
- StyleNet encodes style characteristics from various visual domains—including photorealistic, animation, webtoon, and watercolor—and injects them into the attention layers of the diffusion network.
- GarmentNet extracts fine-grained visual features such as garment texture, shape, and color.
- OmniNet captures the human structure and pose and restores the masked regions.
4. Our Method
4.1. StyleNet
4.2. GarmentNet
4.3. OmniNet
4.3.1. Garment Semantic Conditioning via IP-Adapter
- denote the key–value pairs derived from the text prompt embeddings.
- denote the key–value pairs generated from the style embedding produced by StyleNet.
- denote the garment semantic features generated via the IP-Adapter.
4.3.2. Human-Garment Appearance Fusion via Self-Attention
4.3.3. Style Conditioning via Adaptive Layer Normalization
4.4. Loss Function
4.5. Training Strategy for Human and Garment Adaptation
5. Implementation and Results
5.1. Implementation Detail
5.2. Results
6. Evaluation
6.1. Comparison
6.2. Quantitative Evaluation
6.3. Cross-Dataset Generalization Discussion
6.4. Ablation Study
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NeurIPS 2020, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the ICLR 2021, Vienna, Austria, 4 May 2021; pp. 1–25. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 8176–8185. [Google Scholar]
- Choi, Y.; Park, S.; Lee, S.; Kwak, J.; Choo, J. IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-On in the Wild. In Proceedings of the ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 206–235. [Google Scholar]
- Chong, Z.; Dong, X.; Li, H.; Zhang, S.; Zhang, W.; Zhang, X.; Zhao, H.; Jiang, D.; Liang, X. CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. In Proceedings of the ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar]
- Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 8580–8589. [Google Scholar]
- Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 4606–4615. [Google Scholar]
- Gou, J.; Sun, S.; Zhang, J.; Si, J.; Qian, C.; Zhang, L. Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow. In Proceedings of the ACM MultiMedia 2023, Vancouver, BC, Canada, 7–10 June 2023; pp. 7599–7607. [Google Scholar]
- Ning, S.; Wang, D.; Qin, Y.; Jin, Z.; Wang, B.; Han, X. Picture: Photorealistic Virtual Try-On from Unconstrained Designs. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 6976–6985. [Google Scholar]
- Xu, Y.; Gu, T.; Chen, W.; Chen, C. OOTDiffusion: Outfitting Fusion-based Latent Diffusion for Controllable Virtual Try-On. In Proceedings of the AAAI 2025, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8996–9004. [Google Scholar]
- Han, X.; Wu, Z.; Wu, Z.; Yu, R.; Davis, L. VITON: An Image-based Virtual Try-on Network. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7543–7552. [Google Scholar]
- Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; Yang, M. Toward Characteristic-Preserving Image-based Virtual Try-On Network. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 589–604. [Google Scholar]
- Ge, C.; Song, Y.; Ge, Y.; Yang, H.; Liu, W.; Luo, P. Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 16928–16937. [Google Scholar]
- Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; Luo, P. Parser-Free Virtual Try-On via Distilling Appearance Flows. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 8485–8493. [Google Scholar]
- Issenhuth, T.; Mary, J.; Calauzenes, C. Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On. In Proceedings of the ECCV 2020, Glasgow, Scotland, 23 August 2020; pp. 619–635. [Google Scholar]
- Cui, A.; Mahajan, J.; Shah, V.; Gomathinayagam, P.; Lazebnik, S. Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images. In Proceedings of the CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 8235–8239. [Google Scholar]
- Xie, Z.; Huang, Z.; Dong, X.; Zhao, F.; Dong, H.; Zhang, X.; Zhu, F.; Liang, X. GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 23550–23559. [Google Scholar]
- Choi, S.; Park, S.; Lee, M.; Choo, J. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 14131–14140. [Google Scholar]
- Lee, S.; Gu, G.; Park, S.; Choi, S.; Choo, J. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 204–219. [Google Scholar]
- Morelli, D.; Fincato, M.; Cornia, M.; Landi, F.; Cesari, F.; Cucchiara, R. Dress Code: High-Resolution Multi-Category Virtual Try-On. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2231–2235. [Google Scholar]
- Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 1–12. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the ICCV 2023, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
- Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y.; Qie, X. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 1–10. [Google Scholar]
- Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text-Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. In Proceedings of the NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; pp. 1–12. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]










| Style | Target | Metric | Ours | CatVTON | StableVITON | OOTDiffusion | VITON-HD |
|---|---|---|---|---|---|---|---|
| photorealistic | male | CLIP (for person) ↑ | 0.948 | 0.945 | 0.866 | 0.795 | 0.925 |
| CLIP (for cloth) ↓ | 0.546 | 0.479 | 0.510 | 0.526 | 0.483 | ||
| FID ↓ | 113.530 | 136.284 | 221.039 | 218.031 | 132.419 | ||
| KID ↓ | 0.019 | 0.023 | 0.020 | 0.036 | 0.037 | ||
| female | CLIP (for person) ↑ | 0.905 | 0.903 | 0.827 | 0.821 | 0.895 | |
| CLIP (for cloth) ↓ | 0.646 | 0.506 | 0.555 | 0.611 | 0.523 | ||
| FID ↓ | 159.585 | 167.863 | 253.367 | 242.290 | 171.420 | ||
| KID ↓ | 0.010 | 0.021 | 0.018 | 0.021 | 0.041 | ||
| old person | CLIP (for person) ↑ | 0.927 | 0.943 | 0.818 | 0.812 | 0.907 | |
| CLIP (for cloth) ↓ | 0.555 | 0.447 | 0.500 | 0.545 | 0.463 | ||
| FID ↓ | 236.439 | 114.369 | 265.059 | 242.273 | 158.976 | ||
| KID ↓ | 0.009 | 0.091 | 0.013 | 0.026 | 0.073 | ||
| child | CLIP (for person) ↑ | 0.930 | 0.924 | 0867 | 0.832 | 0.892 | |
| CLIP (for cloth) ↓ | 0.593 | 0.496 | 0.536 | 0.565 | 0.524 | ||
| FID ↓ | 162.779 | 171.584 | 218.992 | 223.761 | 174.003 | ||
| KID ↓ | 0.021 | 0.025 | 0.061 | 0.058 | 0.022 | ||
| animation | male | CLIP (for person) ↑ | 0.953 | 0.947 | 0.825 | 0.872 | 0.948 |
| CLIP (for cloth) ↓ | 0.579 | 0.478 | 0.541 | 0.543 | 0.490 | ||
| FID ↓ | 103.875 | 154.046 | 193.646 | 206.615 | 176.143 | ||
| KID ↓ | 0.005 | 0.038 | 0.011 | 0.014 | 0.050 | ||
| female | CLIP (forvs person) ↑ | 0.942 | 0.930 | 0.861 | 0.841 | 0911 | |
| CLIP (for cloth) ↓ | 0.557 | 0.468 | 0.511 | 0.542 | 0.484 | ||
| FID ↓ | 155.628 | 157.243 | 201.373 | 206.168 | 201.903 | ||
| KID ↓ | 0.013 | 0.022 | 0.014 | 0.017 | 0.007 | ||
| old person | CLIP (for person) ↑ | 0.940 | 0.949 | 0.799 | 0.816 | 0.941 | |
| CLIP (for cloth) ↓ | 0.577 | 0.439 | 0.517 | 0.526 | 0.472 | ||
| FID ↓ | 100.282 | 103.348 | 234.856 | 239.627 | 143.891 | ||
| KID ↓ | 0.005 | 0.062 | 0.001 | 0.011 | 0.058 | ||
| child | CLIP (for person) ↑ | 0.949 | 0.948 | 0869 | 0.894 | 0.943 | |
| CLIP (for cloth) ↓ | 0.567 | 0.458 | 0.520 | 0.490 | 0.471 | ||
| FID ↓ | 186.895 | 104.953 | 271.145 | 197.011 | 156.727 | ||
| KID ↓ | 0.014 | 0.044 | 0.038 | 0.016 | 0.036 |
| Style | Target | Metric | Ours | CatVTON | StableVITON | OOTDiffusion | VITON-HD |
|---|---|---|---|---|---|---|---|
| webtoon | male | CLIP (for person) ↑ | 0.938 | 0.914 | 0.854 | 0.820 | 0.940 |
| CLIP (for cloth) ↓ | 0.541 | 0.481 | 0.518 | 0.543 | 0.486 | ||
| FID ↓ | 109.101 | 175.445 | 180.319 | 227.459 | 159.785 | ||
| KID ↓ | 0.025 | 0.026 | 0.003 | 0.032 | 0.022 | ||
| female | CLIP (for person) ↑ | 0.947 | 0.926 | 0.919 | 0.844 | 0.923 | |
| CLIP (for cloth) ↓ | 0.560 | 0.474 | 0.507 | 0.552 | 0.492 | ||
| FID ↓ | 154.373 | 168.693 | 175.951 | 205.340 | 189.056 | ||
| KID ↓ | 0.005 | 0.008 | 0.007 | 0.011 | 0.020 | ||
| old person | CLIP (for person) ↑ | 0.936 | 0.945 | 0.961 | 0.817 | 0.925 | |
| CLIP (for cloth) ↓ | 0.551 | 0.451 | 0.517 | 0.520 | 0.485 | ||
| FID ↓ | 136.612 | 144.740 | 204.457 | 190.029 | 150.324 | ||
| KID ↓ | 0.028 | 0.046 | 0.005 | 0.014 | 0.045 | ||
| child | CLIP (for person) ↑ | 0.942 | 0.949 | 0.836 | 0.883 | 0.942 | |
| CLIP (for cloth) ↓ | 0.582 | 0.453 | 0.517 | 0.491 | 0.466 | ||
| FID ↓ | 111.999 | 119.561 | 216.960 | 213.377 | 159.754 | ||
| KID ↓ | 0.024 | 0.032 | 0.017 | 0.007 | 0.014 | ||
| watercolor | male | CLIP (for person) ↑ | 0.933 | 0.899 | 0.900 | 0.829 | 0.936 |
| CLIP (for cloth) ↓ | 0.624 | 0.504 | 0.522 | 0.580 | 0.510 | ||
| FID ↓ | 109.170 | 165.194 | 183.500 | 237.846 | 115.074 | ||
| KID ↓ | 0.015 | 0.026 | 0.002 | 0.007 | 0.064 | ||
| female | CLIP (for person) ↑ | 0.927 | 0.888 | 0.900 | 0.857 | 0917 | |
| CLIP (for cloth) ↓ | 0.553 | 0.489 | 0.525 | 0.552 | 0.520 | ||
| FID ↓ | 165.093 | 188.962 | 192.864 | 222.935 | 179.176 | ||
| KID ↓ | 0.001 | 0.035 | 0.009 | 0.008 | 0.014 | ||
| old person | CLIP (for person) ↑ | 0.912 | 0.933 | 0.821 | 0.843 | 0.927 | |
| CLIP (for cloth) ↓ | 0.553 | 0.406 | 0.493 | 0.503 | 0.446 | ||
| FID ↓ | 208.356 | 144.734 | 251.704 | 217.168 | 144.041 | ||
| KID ↓ | 0.005 | 0.068 | 0.009 | 0.028 | 0.059 | ||
| child | CLIP (for person) ↑ | 0.951 | 0.941 | 0.887 | 0.848 | 0.933 | |
| CLIP (for cloth) ↓ | 0.570 | 0.497 | 0.541 | 0.558 | 0.505 | ||
| FID ↓ | 116.212 | 148.449 | 181.683 | 209.659 | 120.766 | ||
| KID ↓ | 0.001 | 0.029 | 0.027 | 0.050 | 0.003 |
| Image | Metric | With | Without | |
|---|---|---|---|---|
| (a) IP-adapter | upper | CLIP (for cloth) ↑ | 0.924 | 0.432 |
| FID ↓ | 100.454 | 120.545 | ||
| KID ↓ | 0.024 | 0.035 | ||
| lower | CLIP (for cloth) ↑ | 0.952 | 0.632 | |
| FID ↓ | 164.023 | 188.152 | ||
| KID ↓ | 0.002 | 0.012 | ||
| (b) GarmentNet | upper | CLIP (for cloth) ↑ | 0.911 | 0.824 |
| FID ↓ | 135.312 | 144.251 | ||
| KID ↓ | 0.025 | 0.046 | ||
| lower | CLIP (for cloth) ↑ | 0.852 | 0.615 | |
| FID ↓ | 140.125 | 200.254 | ||
| KID ↓ | 0.012 | 0.023 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Choi, W.; Yang, H.; Min, K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics 2026, 15, 514. https://doi.org/10.3390/electronics15030514
Choi W, Yang H, Min K. A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics. 2026; 15(3):514. https://doi.org/10.3390/electronics15030514
Chicago/Turabian StyleChoi, Wooseok, Heekyung Yang, and Kyungha Min. 2026. "A Style-Adapted Virtual Try-On Technique for Story Visualization" Electronics 15, no. 3: 514. https://doi.org/10.3390/electronics15030514
APA StyleChoi, W., Yang, H., & Min, K. (2026). A Style-Adapted Virtual Try-On Technique for Story Visualization. Electronics, 15(3), 514. https://doi.org/10.3390/electronics15030514

