Next Article in Journal
Energy-Efficient Wireless Sensor Networks Through PUMA-Based Clustering and Grid Routing
Next Article in Special Issue
Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement
Previous Article in Journal
Topology Optimization and Leakage Current Suppression of Photovoltaic Energy Storage Four-Leg Inverter Based on Independent Split Capacitor
Previous Article in Special Issue
A Novel Anti-UAV Detection Method for Airport Safety Based on Style Transfer Learning and Deep Learning
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clothing-Agnostic Pre-Inpainting Virtual Try-On

1
Department of Artificial Intelligence and Software, Kangwon National University, Samcheok 25913, Republic of Korea
2
Department of Electronic and AI System Engineering, Kangwon National University, Samcheok 25913, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(23), 4710; https://doi.org/10.3390/electronics14234710
Submission received: 1 November 2025 / Revised: 24 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

Abstract

With the development of deep learning technology, virtual try-on technology has developed important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa technology has addressed the texture distortion problem of diffusion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette persist in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing-Agnostic Pre-Inpainting Virtual Try-On). CaP-VTON integrates DressCode-based multi-category masking and Stable Diffusion-based skin inflation preprocessing; in particular, a generated skin module was introduced to solve skin restoration problems that occur when long-sleeved images are converted to short-sleeved or sleeveless ones, introducing a preprocessing structure that improves the naturalness and consistency of full-body clothing synthesis and allowing the implementation of high-quality restoration considering human posture and color. As a result, CaP-VTON achieved 92.5%, which is 15.4% better than Leffa, in short-sleeved synthesis accuracy and consistently reproduced the style and shape of the reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffusion-based virtual inspection systems; they can also contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.

1. Introduction

With the rapid development of computer vision and deep learning technology, controllable person image generation models have found important application value in the fields of e-commerce, fashion, and entertainment. In particular, virtual try-on technology is contributing to improved purchase satisfaction and reduced return rates by allowing consumers to check the wearing effect in advance, without actually wearing clothes, in an online shopping environment. Against this background, the Learning Flow Fields in Attention (Leffa) method proposed by Zhou et al. [1] is drawing attention as an innovative approach for solving the detailed texture distortion problem that existing diffusion-based models were experiencing.
Existing controllable person image generation methods have achieved a high level of overall image quality but have shown limitations in accurately preserving fine texture details from reference images. These issues have been particularly noticeable in clothing containing complex patterns, logos, and texts, and have posed significant barriers to implementing practical virtual inspection services. Leffa identified the root cause of this problem as the attention mechanism’s inability to focus on the correct area of the reference image and addressed this by introducing a normalization loss that induces the learning of the flow field within the attention layer to focus explicitly on the correct reference area.
However, there were synthesis results that could not be covered by the existing Leffa model, and we supplemented them by adding several elements from the pipelines of existing studies. The points that can be added to the existing Leffa model are shown in Figure 1. The first is shown in Figure 1a. Since the existing Leffa model is mainly optimized for virtual inspection centered on upper-body clothes, the detection and masking of the lower-body areas are relatively weak. This could restrict a virtual inspection scenario centered on whole-body clothing or bottoms. In particular, when the lower body could not be detected in the input image, a limitation arose in that the whole body was synthesized even if only the upper-body clothes were synthesized. Therefore, a separate specialized preprocessing strategy that addresses this is needed. The second point is that the existing clothing characteristics included in the input source image had more than a certain level of influence on the final output result. Figure 1b was outputted as a Leffa-Corrective Masking product that addressed the first limitation. When trying to synthesize short-sleeved clothing from an image of a person wearing a long-sleeved top, as shown in Figure 1b, it was observed that the existing long-sleeved shape was maintained regardless of the input garment. This seemed to be because the skin areas intended to be exposed, such as the forearm, shoulder, and neck, were not sufficiently restored in the process of removing the clothing, and the existing clothing type acted as an interference factor in the subsequent clothing synthesis process. This phenomenon could lead to unnatural results when moving between items of clothing with large differences in shape, such as sleeve length and neckline.
To address these two limitations, CaP-VTON is proposed in this study. CaP-VTON introduces an improved masking mechanism based on DressCode while maintaining Leffa’s excellent attention-based flow field learning structure, allowing it to precisely separate each clothing type from the bottom and top, thereby improving the model’s ability to distinguish between and detect the respective areas and integrating the skin-infection pipeline to reduce interference in the synthesis coming from remaining existing clothing information. In this way, we were able to present an integrated solution that satisfies the three goals of preserving detailed textures, synthesizing full-body clothing, and replacing clothes independent of input images at the same time, while maintaining the advantages of the existing Leffa pipeline. In particular, it was confirmed that the interference phenomenon caused by existing clothing characteristics was significantly improved, along with the problems of morphological inconsistency, the unnaturalness of boundaries, and disharmony with the body ratio in the lower synthesis. In addition, this study has high practical value, in that it provides a general method for improvement that can be applied to various diffusion-based models by improving the accuracy of the masking process and the quality of skin restoration, while maintaining the model-agnostic characteristics of the existing Leffa model. This approach is differentiated from those in existing studies that have presented solutions dependent on specific models or datasets, and it is expected that it can be widely used in various applications in the virtual try-on field in the future.

2. Related Work

2.1. Evaluation of Generative Models

The generative model is difficult to evaluate compared with the classification model. This is because there is no single image that can be used to compare with the results obtained through the generative model. It is necessary to create an index for evaluation following the distribution of actual domain images, and various evaluation indicators have been proposed. The indicators used in the evaluation are as follows.
The FID (Frechet Inception Distance) is a key indicator of the quality of images created by generative models [2]. The FID measures the statistical distance between the two distributions by comparing the distributions of real and generated images, extracts feature vectors in 2048 dimensions using a pre-trained Inception v3 model, and measures the distance by calculating the mean (μ) and covariance (C) of the real and generated images. This means that the lower the FID value, the more similar the generated image is to the real image.
d 2 = | | μ 1 μ 2 | | 2 + T r ( C 1 + C 2 2 ( C 1 C 2 ) )
The SSIM (Structural Similarity Index Measure) is an index that quantifies structural similarity between two images [3]. The SSIM outputs a value between 0 and 1 by comparing the luminance, contrast, and structural elements of two images; the closer the value is to 1, the more structurally similar the two images are. In particular, SSIM is more advantageous in evaluating the degree of retention of patterns and local structures than errors in pixel units. Generally, it is used to evaluate the visual quality of high-resolution images.
S S I M ( x , y ) = ( 2 μ x μ y + C 1 ) ( 2 σ x y + C 2 ) ( μ x 2 + μ y 2 + C 1 ) ( σ x 2 + σ y 2 + C 2 )
The LPIPS (Learned Perceptual Image Patch Similarity) is an indicator that quantifies perceptual differences between two images using deep learning-based image recognition models [4]. The LPIPS measures similarity by extracting feature vectors from different layers of pre-trained neural networks (e.g., AlexNet, VGG, SqueezeNet, etc.). The lower the LPIPS value, the more visually similar the two images are, and it has a complementary relationship with the FID and SSIM in that it better reflects high-dimensional semantic similarity than simple pixel differences.
L P I P S ( x , y ) = l 1 H l W l h , w w l ( f ^ l x ( h , w ) f ^ l y ( h , w ) ) 2 2
The various indicators described in this chapter are used to compare the results of the model we proposed in Section 4.2.2 with those from existing models. In addition to this, in this study, the form-consistency of the virtual inspection model is regarded as a key criterion directly related to the actual service quality, and a statistical-based evaluation index was introduced to measure this objectively. The normal output rate is a representative statistical-based quantitative indicator, which refers to the percentage (%) of cases wherein the normal short-sleeved silhouette is accurately synthesized, especially when an image with long-sleeve clothing is input and the reference clothing is short-sleeved. The main purpose of this indicator is to directly quantify the degree of reflection of the reference clothing silhouette and the interference phenomenon induced by input clothing characteristics, which are difficult to explain using simple image quality and similarity indicators (FID, SSIM, LPIPS, etc.). Existing models often do not match the actual reference clothing type (short-sleeve, sleeveless, etc.) and leave the input clothing silhouette untouched (especially long-sleeve), but the proposed technique is fundamentally improved via pipeline strategies such as generate skin and improvement masking. The normal output rate is calculated by evaluating the output image when the reference clothing is short-sleeved and the input is long-sleeved images from the dataset, classifying whether the short-sleeved form is actually implemented (normal/abnormal) as in the example, and then calculating the normal ratio.

2.2. Virtual Try-On Models

Virtual Try-On models can be divided into classical, diffusion-based, and attention-based models. Early Virtual Try-On approaches mainly used GAN-based warping techniques and segmentation. CP-VTON [5] handled clothing deformation using thin plate spline (TPS) warping and segmentation, but there was a limit to realistic image generation as the FID value was very high at 47.36. Since then, VITON-HD [6] has achieved FID 11.74 at 1024 × 768 resolution via the introduction of alignment-aware segment (ALIAS) normalization and warping unit, but accurate clothing alignment and natural deformation in complex poses were still limited. ACGPN [7] introduced an adaptive content-generating and -preserving network and recorded an FID of 26.45, but its limited performance at 256 × 192 resolution made it difficult to expand to high-resolution usage, and HR-VITON [8] achieved an FID of 10.91, but sleeve-squeezing and waist-squeezing artifact problems persisted. These classical models achieved an average FID of 24.06, with fundamental limitations in preserving complex clothing textures and generating realistic wearing effects.
With the advent of diffusion models in 2023, great progress has been made in the field of Virtual Try-On. Diffusion models have enabled more stable learning and high-quality image generation compared with a GAN (Generative Adversarial Network) [9], and achieved notable performance improvements, with an average FID of 7.18. StableVITON [10] achieved an FID of 6.52 by applying zero cross-attention to the LDM (Latent Diffusion Model), but a segment detail loss problem arose due to attrition misalignment. IDM-VTON [11] reached an improved FID of 6.29 by utilizing segment-agnostic person representation and mask, but there were limitations regarding customization deterrence and continuous consistency in complex clothing transformation. LaDI-VTON [12] recorded a relatively high FID value of 8.85 by combining Latent Diffusion, Textual-Inversion, and CLIP. OOTDiffusion [13] implemented multiple outfit processing through parallel outfitting UNet and outfitting fusion, and recorded an average FID score of about 10, with 11.03 for the top, 9.72 for the bottom, and 10.65 for dresses.
Later, the introduction of transformers and attention mechanisms enabled more sophisticated feature matching and spatial relationship modeling in Virtual Try-On. The multi-head self-attention model proposed by Vaswani, A. et al. [14] became the basis of the transformer model, and models applying it in Virtual Try-On have also begun to emerge. CatVTON [15] selectively fine-tuned the self-attention module alone within the diffusion U-Net, and conducted lightweight learning with 49.57 million parameters, removing text encoders and cross-attention, and maximizing efficiency by simply concatenating and inputting human and costume images in the spatial dimension. However, detailed semantic alignment control was limited. TryOn Diffusion [16] was based on Latent Diffusion, and precise matching between clothes and people was performed using segment-aware person presentation and cross-attention. It recorded an FID of 23.3 on the VITON-HD dataset and showed improved performance compared with the existing diffusion model regarding detailed clothing texture reproduction and pose preservation. Human Diffusion [17] separated human image representations into separate latent spaces and proposed a diffusion pipeline that combines cross-attention and mask impression. It also provides natural results for whole-body synthesis and multi-clothing category replacement and achieved FID scores in the range of 30.4–31.2 for the DeepFashion-Multimodal dataset.
Leffa [1] improved the self-attention and cross-attention map by applying flow field regularization to the attention of the diffusion model. This enabled more sophisticated attention control than the existing mask inpainting method, zero cross-attention method, and prompt-aware mask method, but there were still problems with the complexity and fine-detail distortion of flow field regularization.
The existing Leffa applied just flow field regularization to the attention mechanism to enable more sophisticated attention control than existing methods, but it still showed limitations in complex clothing deformation, detailed texture preservation, and accurate area segmentation in the pre-processing stage. This situation is shown in Figure 1; several steps were added to the model’s pipeline to overcome this limitation.
In this study, we propose an inpainted pre-processing input-based attention flow system to improve the limitations shown by the existing Leffa model. We propose the CaP-VTON (Clothing-Agnostic Pre-Inpainting Virtual Try-On) model, adding a part that can compensate for these shortcomings to the existing VITON-HD model of the Leffa model. Through this model, the stability of the diffusion model, the attention mechanism of the transformer, and the advantages of the classical warping technique are selectively integrated to address the limitations exemplified by attention control, the inaccuracy of the pre-processing step, and the lack of consistency between various clothing types seen in existing studies.

3. Proposed Method

3.1. System Flow: Concept of CaP-VTON

This study proposes the CaP-VTON (Clothing-Agnostic Pre-Inpainting Virtual Try-On) model, a new model that integrates two key improvement factors to complement the structural limitations of the existing Leffa pipeline and enable a natural and coherent transition between various clothing categories. The proposed improvement factors are the DressCode-based multi-category masking stage, which covers full-body clothing such as top, bottom, and dress, and the Stable Diffusion-based skin inpainting pre-processing stage for removing existing clothing silhouettes and restoring exposed skin.
The overall flow is shown in Figure 2. This study can be largely divided into two stages. One is the Modified Leffa Pipeline for addressing the limitation whereby the detection and masking of the lower-body clothing area are relatively weak because the existing VITON-HD Leffa model is mainly optimized for virtual inspection centered on upper-body clothes, and the other is the Stable Diffusion Generation Skin Inpainting step for addressing the limitation whereby the existing clothing characteristics included in the input source image affect the final output result beyond a certain level. In the first step, when the user inputs a source image, the semantic part of the human body is first subdivided using the SCHP (Self-Correction for Human Parsing) model [18], and at the same time, various clothing areas such as top, bottom, and dress are accurately masked through the DressCode model. In addition, by extracting the person’s posture information using OpenPose [19], the human structure is not distorted during the subsequent inpainting process. The body mask and pose information obtained in this step are used in the inpainting process to remove the existing clothing area and naturally restore the skin and short-sleeved silhouette. In the first step, a pre-processed inpainting image (Figure 2B) is generated from the input source image (Figure 2A). In the second step, a mask of the pre-processed image is made through the DressCode model by using the pre-processed inpainting image (Figure 2B) output in a clean form without the silhouette of the existing long-sleeved clothing, and this is transmitted to diffusion painting by applying the attention-based flow field regularization of the VITON-HD model along with the reference clothing information to generate a final virtual fitting image (Figure 2C).

3.2. Stable Diffusion Generate Skin Inpainting: Removing Existing Clothing Silhouette

When trying to synthesize short-sleeved clothing from the image of a person wearing a long-sleeved top using the existing model, a problem arises in that the existing long-sleeved shape is maintained regardless of the type of reference clothing, as shown in Figure 3a,c, and the silhouette of the existing clothing remains, as shown in Figure 3b. This is because the silhouette of the existing clothing interferes with the synthesis of new clothing because the areas of exposed skin, such as the forearms, shoulders, and neck, are not properly restored during the clothing removal process.
To solve this problem, the generate skin method pipeline inserts short-sleeved clothes and realistic skin into the area where existing clothes have been removed, which does not affect the output. This is intended to accurately reflect the unique shape and style of the reference clothes regardless of the existing clothing characteristics of the input image.
The number of data points used in the garment silhouette removal experiment in the Generate Skin interposing module was set to 599 to simultaneously secure the reproducibility and the efficiency of the experiment in a limited GPU resource environment (Colab-based). “Wearing Held Tight Short Sleeve Shirt, high quality skin, realistic, high quality” was used for the prompt, and “Blurry, low quality, artifacts, deformed, ugly, texture, watermark, text, bad anatomy, extra limbs, face, hands, fingers” were used to preclude the creation of elements corresponding to the prompt. The inference step was set to 20 steps, and no separate fine-tuning was performed.

3.2.1. Generating Input Information to Build Synthetic Model

To perform inpainting through the Generate Skin function, an OpenPose image containing the inpainting mask image and the pose information of the input image is required. The Generate Skin Inpainting Mask Output of Figure 4 was generated using the masking pipeline provided by Leffa’s DressCode model. In addition, the OpenPose Output of Figure 5 has been used as conditioning information for the ControlNet in the process of Stable Diffusion-based inpainting. As shown in Figure 5, when the ControlNet is not used, the pose identity of the original person is not accurately transmitted to the network, resulting in the problem of creating a structurally distorted image, featuring, for example, a misalignment of the body position such as the arm or shoulder, or mixing with the background. As a result, we determined that the problem of preserving the pose and securing the accuracy of the skin area in the process of inpainting directly affects the realism of the final result, and to solve this problem, information on the Generate Skin Inpainting Mask and OpenPose-based ControlNet [20] was used as the input information for the Generate Skin function. In addition, as can be seen in Figure 5, there were cases wherein the color of the inpainted skin did not properly reflect the skin tone of the original image, creating darker or brighter skin than was actually present. To compensate for this, the Skin Tone Extraction technique was introduced to create a skin texture that naturally harmonizes with that in the original image. This technique is implemented by detecting the skin area in the YCrCb color space and calculating the average skin tone value in the HSV color space. The skin area created through this approach is naturally blended with the skin tone of the original image, such that the overall visual consistency of the resulting image can be maintained.

3.2.2. MajicMix Realistic Model-Based High Quality Skin Synthesis

The core of the Generate Skin pipeline is inpainting the skin area using the MajicMix Realistic model [21]. This model is based on Stable Diffusion 1.5, and is a high-quality checkpoint model trained by fusing various model weights. In particular, it is optimized for generating Asian skin tones, so it shows strength in deriving natural results from Asian images. In addition, this model introduces the Noise Offset Technique, which enables more sophisticated light source expression and smooth shadowing during the diffusion process. It exhibits excellent performance in restoring the texture, shade, and color of complex skin areas, such as arms, shoulders, and neck, that are exposed after clothing removal. Figure 6 shows the results output by the Generate Skin pipeline.

3.3. Modified Leffa Pipeline: Multi-Category Enabled Based on DressCode

There are two models that together constitute the Leffa model: one trained with VITON-HD data and one trained with DressCode data. The VITON-HD Leffa model was optimized for phase-oriented synthesis. For this reason, there were limitations in the detection of the lower area and in the whole-body synthesis, and the image with a high proportion of lower body area showed unstable synthesis results. To solve this problem, as shown in Figure 7, the inpainting mask employed the DressCode model, and the mask of the VITON-HD model with high output image quality was derived from the Leffa Diffusion inpainting output. A masking strategy was constructed using the multi-category structure of the DressCode dataset.
Figure 8 shows a comparison of the masking qualities of the two models. The VITON-HD model was trained on the VITON-HD dataset specializing in upper body clothing images and is mainly optimized for image synthesis. Accordingly, images centered on the upper body show excellent visual quality, but information related to the lower body is not included, so there is a limit to the detection and masking of the lower body area. On the other hand, the DressCode model learned from a dataset including various categories of clothing, such as tops, bottoms, and dresses, which allows for masking with full-body clothing. In fact, in this study, when using the DressCode model when generating an inpainting mask, stable masking results were secured in multi-category situations.
To compensate for these structural differences, this study designed a pipeline that combines the advantages of the two models. In the in-painting masking process, multi-category detection was performed using the DressCode model, and better image quality was secured by using the VITON-HD model in the final diffusion synthesis step. To verify this numerically, the synthesis performances of the two models were compared, as shown in Table 1. As a result, the FID score was 4.54 for the VITON-HD model, which is lower than that of the DressCode model (5.05), showing an advantage regarding the overall visual quality of the generated image. On the other hand, regarding the SSIM and LPIPS indicators, the DressCode model is better, showing that it is superior regarding the structural and detailed texture expression of actual clothes.

4. Experimental Results

4.1. Experimental Environment

This study was conducted in the Google Colab environment, and Ubuntu 20.04 LTS was used as an operating system. The programming language consisted of Python 3.11.13 version, and GCC 11.4.0 was used as the compiler. As the main library, packages related to image processing and deep learning such as OpenCV and TorchGeometry were used, focusing on PyTorch (1.6.0 or higher, support for GPU acceleration).
As a model for image generation, the ChilloutMix checkpoint based on Stable Diffusion 3.5 was used. This is a derivative weight (checkpoint) that improves human texture and lighting expression while maintaining the structure of Stable Diffusion 3.5, and is driven by the Diffusers library. The main settings in the generation process included cfg scale = 7, steps = 25, sampler = Euler a, and clip skip = 2. Person images generated through this setting were used for matching and synthesis experiments with clothing images within the CaP-VTON pipeline.
The hardware configuration consisted of NVIDIA A100 GPU (40 GB memory), 83.5 GB system RAM, and Intel Xeon-based CPU, and the experiment was conducted through Colab interface based on the Jupyter Notebook (https://jupyter.org/install, accessed on 25 November 2025). When the Stable Diffusion model was configured with the above settings, it took an average of about 8 s to generate a single image.
Two types of person images were constructed to verify the performance of the model in new data and to evaluate whether the model maintains stable results even in real person-based data. A total of 1100 person data sets were constructed by combining 100 AI-generated person images created by the Stable Diffusion 3.5-based ChilloutMix model and 1000 real person images of the VITON-HD_Test dataset, and 1100 corresponding clothing data of the same size were used together. All images were normalized based on the basic resolution 1024 × 768 to secure the reproducibility and efficiency of the experiment in limited GPU resources (based on Colab), and the experimental dataset was finally composed of 1100 clothing images and 1100 human images.

4.2. Results

4.2.1. Visual Comparison of the Synthesis Results

Figure 9 shows a visual comparison of the synthesis results generated by various virtual wearing models based on the same input image and the same reference clothing. This comparison was evaluated beyond simple visual quality, focusing on whether the silhouette of the reference clothing (especially the sleeve length and arm exposure area) was accurately reproduced.
The comparison shows that the existing CatVTON, IDM-VTON, Leffa–VITON-HD, and Leffa–DressCode models were commonly influenced by the existing long-sleeved silhouette of the input image, showing limitations in not being able to accurately reproduce the design of reference clothing, such as the presence of abnormally long sleeves in the output image or the insufficient exposure of arms and shoulders. In particular, it was observed that the Leffa–VITON-HD model was synthesized up to the lower half, which was not intended to be synthesized, and despite the fact that short-sleeved clothing was actually applied, it was observed that the shape of the existing long-sleeved sleeves remained in the output, or the human body was unnaturally distorted.
On the other hand, our proposed CaP-VTON model effectively removed existing long-sleeved clothing and naturally and consistently restored exposed skin areas, such as arms, shoulders, and armpits, by utilizing the generate skin-based preprocessing technique. As a result, the short-sleeved silhouette of the reference clothing was clearly reproduced in the output image, and the pattern and shape were neatly reflected without interfering with the existing clothing. This difference concerns not only visual naturalness but is also an important indicator of how accurately the style of the reference clothing can be reflected. While existing models are strongly influenced by input clothing, CaP-VTON shows that it is possible to reproduce a form independent of input clothing through structural consistency-based preprocessing. This is a critically important factor in a whole-body virtual wearing system intended for practical clothing replacement, and this study is key in distinguishing it from existing methodologies.
In conclusion, CaP-VTON is the model that most accurately reflects the style and silhouette of reference clothing; we have experimentally demonstrated that it provides the most reliable synthetic quality from a user experience perspective beyond simple image quality or structural similarity indicators.

4.2.2. Comparison for Image Synthesis Image Quantity

Table 2 compares performance between the existing virtual wearing models and the proposed CaP-VTON model, with FID, SSIM, and LPIPS indicators based on the assigned category. The proposed CaP-VTON model achieved a slightly lower quantitative performance, with FID 11.46, SSIM 0.8573, and LPIPS 0.0849, due to the addition of the image inference pipeline (Generate Skin method).
When compared numerically only, our research shows moderate quality among the other synthetic models mentioned in the paper. This means that the synthesis result is not the best, and it does not produce heterogeneity to the degree to which humans perceive it. These figures are general indicators that focus on overall image quality and perception-based similarity. In other words, there is a limit to evaluations of the accuracy of the transition from long-sleeve to short-sleeve, which is the purpose of our paper. Therefore, observation-based accuracy was added and evaluated.

4.2.3. Ablation Study

Figure 10 shows the result of an ablation study analyzing the effect on the resulting quality of each major element comprising the CaP-VTON model. The same input image and reference clothing were applied in each configuration, and the short-sleeved silhouette and the degree of arm/shoulder skin exposure were compared in the output image. As a result, when using only Leffa, most of the existing long-sleeved clothing silhouettes remained present in the output image, and the exposure of the arms and shoulders was incomplete. In the case of Leffa + DressCode, the patterns and silhouettes of some reference clothing were reflected, but complete transformation was not made due to the influence of the existing long-sleeved structure. In the case of Leffa + Generate Skin, a significant part of the silhouette of the existing long-sleeved clothing remained, and the short-sleeved silhouette was still incompletely reproduced. Finally, the CaP-VTON model of Leffa + DressCode + Generate Skin integrated all the components required to minimize the influence of existing long-sleeved clothing, such that the short-sleeved silhouette and pattern of the reference clothing were naturally reflected.
Table 3 is the result of comparing the performance between the existing virtual wearing models and the proposed CAP-VTON model with FID, KID, SSIM, and LPIPS indicators based on the category of the award. Since Leffa does not significantly modify the original person texture, it recorded the lowest FID (4.54) and showed an edge in the overall distribution-based visual quality. On the other hand, the proposed CAP-VTON includes a Generate-Skin step of removing the existing silhouette around the sleeves and arms and reconstructing the new skin area. Since this process changes the texture and color distribution of a person globally, the FID·LPIPS, which measures the difference from the overall statistical distribution, showed relatively high values (FID 11.46, LPIPS 0.0849). The deterioration of the CAP-VTON indicator can be interpreted as a structural trade-off according to the design characteristics that actively correct the global image to improve the goal of sleeve-shape consistency rather than a defect in the model. (This part will be described in detail in Section 5.4) And these figures are general indicators centered on overall image quality and perception-based similarity, and there is a limit to evaluating the accuracy of long-sleeved to short-sleeved conversion, which is the purpose and core task of this study. Accordingly, in this study, direct observation-based evaluation and quantitative comparison were conducted in parallel.

4.2.4. Evaluation Based on Short-Sleeve Synthesis Probability

Leffa + DressCode and CaP-VTON were selected to quantitatively compare the results of short-sleeved clothing synthesis by the CaP-VTON model. The reasons for this are as follows. First, as pointed out at the beginning of this paper, Leffa’s detection is limited to bottoms and by the lack of connectivity to clothing types due to modified masking, so it is not suitable for use as a comparison target when evaluating the improvement induced by the proposed model. In other words, due to structural limitations, short-sleeved clothing conversion itself is not smoothly performed by this model, making it impossible to compare performance under the same conditions. Second, Leffa + Generate Skin is a configuration in which only a skin restoration module is applied without DressCode data, and it is insufficient to evaluate clothing silhouette conversion in the virtual try-on process.
Figure 11 is the result of comparing the output of the Leffa model and the CaP-VTON model learned with the DressCode dataset to quantitatively compare the short-sleeved clothing synthesis results of the proposed CaP-VTON model. In addition to the existing example images, real person images of the DeepFashion dataset were added to the image comparison to verify the quality of synthesis in more realistic situations.
The DeepFashion dataset is a dataset for the purpose of clothing recognition and attribute classification, and the purpose and characteristics of data are different from VITON-HD or DressCode configured for virtual try-on. Therefore, when applying DeepFashion data to a model that has already been sufficiently trained on VITON-HD or DressCode data, such as existing Leffa, the model tests the performance in new data and determines whether it can maintain stable performance in real person data or not as an objective evaluation index.
As can be seen from the figure, the silhouette consistency between short-sleeved and sleeveless clothing is more natural than that achieved by Leffa, and the boundary restoration of skin-exposed areas is smooth. In particular, when converting the long-sleeved input image to an output with short-sleeved clothing, it was observed that residual sleeves, parts of the body, or discontinuous textures remained on the arm in the existing Leffa, but in the proposed CaP-VTON model, the silhouette of the existing clothes was completely removed through the combination of Generate Skin pretreatment and multicategory masking, and the arm shape and skin tone were naturally restored.
In this study, CLIP-based clothing shape consistency evaluation was performed to automatically determine whether the reference short-sleeved clothing was actually synthesized into a normal short-sleeve image, with the intention of expanding to a long-sleeved input image. This evaluation was implemented by comparing the semantic similarity between images and texts using OpenAI’s CLIP (ViT-B/32) model. Specifically, each reference vector was generated by embedding two text prompts—”A person wearing a short sleeve” and “A person wearing a long sleeve”. After that, for each composite image, the visual feature vector was extracted through the image encoder of CLIP, and the Cosine Similarity between the two text vectors was calculated. Through this approach, if the composite image showed a greater similarity to the “short sleeve” prompt, it was classified as normal output, and on the contrary, if it showed a greater similarity to the “long sleeve” prompt, it was classified as abnormal output. In addition, if the difference between the two similarities (score_diff = correct_score–wrong_score) was 0.02 or more, it was classified as normal with clear consistency, and if it was less than 0.02, it was partially classified as normal.
Figure 12 is a visualization of the comparison of Correct, Partially Correct, and Incorrect output ratios for four models: CaP-VTON, Leffa only, Leffa + DressCode, and Leffa + GenerateSkin. Of the 1100 test images, 801 (72.8%) of CaP-VTON was judged as normal output, which is higher than Leffa only (65.27%), Leffa + DressCode (68.18%), and Leffa + GenerateSkin (63.63%). The Partially Correct ratio also showed a stable tendency of CaP-VTON at 22.0%, and on the contrary, the abnormal output ratio was the lowest at 5.18%, demonstrating higher shape consistency stability than all other models.
These results imply that the combination of Generate-Skin-based pre-processing and multi-category masking effectively eliminates the long-sleeved silhouette residue, lack of arm/shoulder exposure, and retail boundary distortion that have occurred repeatedly in existing models. In particular, while Leffa family models are strongly dependent on the original character silhouette, CaP-VTON is able to reproduce the silhouette of the reference garment more accurately, as it structurally removes the existing long-sleeved texture and then restores its short-sleeved shape.
Meanwhile, Leffa showed relatively excellent figures in FID or SSIM-based global quality indicators, but these global indicators alone have a limitation in that it is difficult to explain the accuracy of the actual clothing type replacement. The most important factor from a user’s point of view is not how similar the image is to the original, but because the morphological characteristics of the reference clothing are accurately reflected. CLIP-based shape consistency evaluation is significant in that it directly quantifies these practical performances, and it is an index that faithfully reflects the purpose and evaluation goals of this study.
In conclusion, CaP-VTON recorded the highest normal and lowest error output rates among the four models, demonstrating clear performance improvement in terms of structural consistency. This demonstrates the practical validity of this pipeline, which is designed to minimize reliance on traditional clothing forms, accurately reflect the silhouette and style of reference clothing, and demonstrates its applicability in full-body virtual fitting and high-dimensional style switching applications.

5. Discussion

This chapter describes the contribution of our paper and the limitations of our research. Section 5.1, Section 5.2 and Section 5.3 describe our contribution to research, Section 5.4 describes the limitations of the FID indicator, and Section 5.5 describes the limitations of our research.

5.1. Improved Entire Pipeline

The improved masking pipeline and Generate Skin-based skin inpainting pipelines do not just complement each function but also create complementary synergies at the entire pipeline level. In particular, the integration of these two modules effectively alleviated the structural limitations of the existing Leffa model, and produced stable and natural results, especially when long/short-sleeve conversion or partial skin exposure correction was required.
The masking pipeline utilizes multi-category information from the DressCode dataset to perform accurate clothing boundary detection and, based on the results, generates an appropriate inpainting mask to remove existing clothing areas. Subsequently, the generate skin pipeline performs a function of blocking existing clothing characteristics included in the input image from interfering with the final output by realistically restoring the skin and short-sleeved silhouette to the removed area. This dual processing structure provides an ideal condition for accurately and consistently reflecting the shape of the reference clothing, regardless of the shape of the existing clothing.
For example, in a scenario wherein a person wearing a long-sleeve shirt wears a sleeveless dress, the masking pipeline accurately detects the long-sleeve area, from the shoulder to the wrist, and the Skin Inpainting Pipeline naturally restores the skin therein, thereby accurately reproducing the exposure design of the reference dress. This is not just a partial modification but a structural improvement of the entire process, leading to pre-masking–inpainting–diffusion for whole-body synthesis.

5.2. Support for Replacement Across Different Clothing Types

This integrated pipeline supports replacement between heterogeneous categories such as top/bottom and top/dress by utilizing the multi-category structure of the DressCode dataset. As a result, the constraints of the top central structure of the existing Leffa model are alleviated, and relatively natural synthesis of complex clothing structures (dress, jumpsuit, etc.) is possible. In particular, this function still operates most stably when the shape of the reference clothing is clearly distinguished from the body silhouette.

5.3. Clothing-Agnostic Properties

The existing virtual try-on models have limitations, such as not being able to completely remove the silhouette of the clothes in the input image, or being synthesized into the bottom area. For example, when short-sleeved clothes are synthesized on a person wearing a long-sleeved shirt, the arm outline of the original image remains very small, resulting in unnatural overlapping in the resulting image.
In this study, to address these limitations, the existing clothing silhouette was removed using a skin inpainting pipeline based on the DressCode masking pipeline and Generate Skin method, and the natural arm and skin areas suitable for the body structure of the original person were restored. Through this, regardless of what type of clothing the input image features, a clothing-agnostic model that can synthesize new reference clothing while the clothing silhouette of the original image is completely removed has been implemented. This approach provides more realistic and natural results even in the future synthesis of skin-exposing clothing such as short-sleeves and sleeveless clothes. Therefore, the cloning-agnostic design of this study serves as an important basis for enabling stable clothing conversion, even in various exposure forms or pose deformation situations, in the future.

5.4. Limitations of FID and Interpretation in Virtual Try-On

The improved masking pipeline and Generate Skin-based skin inpainting pipelines do not just complement each function, but also create complementary synergies at the entire pipeline level. In particular, the integration of these two modules effectively alleviated the structural limitations of the existing Leffa model, and produced stable and natural results, especially CaP-VTON has structural properties that completely remove the existing sleeve silhouette based on the image of a long-sleeved person wearing it and create a wide range of skin areas such as arms, shoulders, and armpits. This intensive structural transformation fundamentally changes the texture and color distribution of the original image, and as a result, it appears relatively unfavorable in the FID index, which evaluates whether the generated image maintains the same statistical distribution as the original. In other words, the increase in the FID of CaP-VTON is not a model error or performance degradation, but a normal result due to the design nature of intentionally reconstructing the original texture for long-sleeved to short-sleeved transformation. These characteristics exactly correspond to the structural reasons why Leffa, who almost maintains the original texture, records low FIDs.
However, in real-world Virtual Try-On applications, how accurately they reproduce the shape of the reference garment acts as a much more important factor than the statistical distribution similarity with the original image. In the CLIP-based short-sleeved consistency evaluation (clip_sleeve evaluation) performed to confirm this, CaP-VTON recorded a normal output rate about 7.53%p higher than that of Leffa, showing a remarkable advantage in its ability to reproduce the actual short-sleeved and sleeveless silhouette. This is a key indicator that CaP-VTON is superior in terms of practical functionality as a virtual wearing model beyond simple image quality.
In addition, CaP-VTON maintained a value almost similar to that of Leffa in terms of SSIM. This means that even though the FID has risen due to the difference in the original distribution, the image structural consistency and visual clarity remain at the same level as the existing model. Therefore, FID is difficult to see as a single indicator suitable for judging the performance of this model because the structural purpose (shape replacement) and evaluation criteria (distribution maintenance) of CaP-VTON fundamentally conflict.
Finally, CaP-VTON cannot be evaluated with a single FID indicator, but it clearly demonstrates practical usefulness and model improvement in real-world virtual fitting situations by showing clear performance improvements in purpose-oriented indicators such as CLIP-based clothing shape consistency, normal output ratio, and SSIM.

5.5. Limitations of Our Model

The CaP-VTON model proposed in this study effectively supplemented the core limitations of the existing Leffa model through an improved masking pipeline and skin-in-painting-based preprocessing structure, but still has the following two major constraints.
The proposed structure is not equally effective for all types of clothing transformation. In the case of complex pattern changes, detailed reproduction of material textures, or full-body synthesis including the entire body, conventional VITON-HD or StableVITON may show better quality. Therefore, CaP-VTON is suitable for upper body-specific virtual try-on that requires skin restoration, and it is desirable to use it as an auxiliary tool for texture-based transfer for texture-oriented styles.
In addition, the proposed structure is switchable between heterogeneous categories but still operates most stably when the shape of the reference clothing is clearly distinguished from the body silhouette. Discontinuous results can occur in translucent or complex materials. In future studies, improvements are required to estimate the body-clothing boundary more accurately even in these areas.

6. Conclusions

This study has proposed a new pipeline structure, CaP-VTON (Clothing-Agnostic Pre-Inpainting Virtual Try-On), to overcome the limitations of the existing Leffa-based virtual try-on system. While Leffa showed excellent performance in detailed texture representation through attention-based flow field learning, it had structural limitations such as bottom detection inaccuracy and the persistence of existing clothing silhouettes. To solve these two key problems, this study designed a dual complementary pipeline that integrates (1) a multi-category masking pipeline based on DressCode and (2) a Generate Skin preprocessing mechanism based on Stable Diffusion.
The masking pipeline has been improved to accurately detect various clothing categories such as top, bottom, and dress, and the skin inpainting pipeline is configured to remove existing clothing areas and to naturally restore skin and short-sleeved silhouettes to accurately reflect the style of the reference clothing. In particular, the Generate Skin method contributes to stable transformation from long-sleeved to short-sleeved or sleeveless clothing by naturally restoring skin areas such as arms, shoulders, and armpits, according to the body’s posture, after removing clothing. The experiment showed that CaP-VTON had the same quantitative index as the existing Leffa and, at the same time, achieved an accuracy of 92.5% in the evaluation of consistency based on the accuracy of short-sleeved silhouette synthesis, achieving a 15.4% higher performance than Leffa’s 77.1%. In addition, it was confirmed that CaP-VTON consistently reproduces the sleeve length, pattern, and style of reference clothing, unlike other existing models.
This study makes the following contributions to the virtual inspection model. First, the bottom and dress masking functions were successfully implemented to support whole-body synthesis. Second, an input-independent synthesis structure was established by removing the interference of existing clothing. Finally, the model-agnostic modular structure secured a versatility that made it applicable to various diffusion-based models. This model thus represents a foundational technology in building a more precise and practical virtual start-up system in the actual e-commerce environment in the future and has the potential to expand into a variety of fields such as the simulation of customized clothes, avatar generation, and clothing design previews. Future research is expected to further expand the practicality and applicability of the CaP-VTON pipeline by further elaborating the processing of complex poses, backgrounds, and interactions with non-clothing elements such as accessories and hair; improving inference speed in real-time environments; and integrating all this with a dynamic clothing recommendation system based on user feedback.

Author Contributions

S.K., H.J.L., J.L., and T.L. participated in all phases and contributed equally to this work. Their contributions to this paper are as follows. S.K. contributed to the methodology, software, and formal analysis. H.J.L. contributed to the investigation, resources, and data curation. J.L. contributed to the original draft preparation, and visualization. The contributions of T.L. were reviewing and editing of the writing, supervision, and project administration. Furthermore, both contributed to the conceptualization and validation of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a 2022 Research Grant from Kangwon National University and Technology Innovation Program (RS-2024-00507228, Development of process upgrade technology for AI self-manufacturing in the cement industry) funded By the Ministry of Trade, Industry & Energy (MOTIE, Republic of Korea).

Data Availability Statement

The two types of Leffa dataset [1] used in this study are publicly available and can be accessed at https://github.com/shadow2496/VITON-HD, https://github.com/aimagelab/dress-code, and accessed on 20 September 2025.

Acknowledgments

This study was supported by 2022 Research Grant from Kangwon National University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhou, Z.; Liu, S.; Han, X.; Liu, H.; Ng, K.W.; Xie, T.; Cong, Y.; Li, H.; Xu, M.; Pérez-Rúa, J.M.; et al. Learning Flow Fields in Attention for Controllable Person Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 2491–2501. [Google Scholar] [CrossRef]
  2. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6626–6637. [Google Scholar] [CrossRef]
  3. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  4. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
  5. Wang, B.; Zheng, H.; Liang, X.; Xhen, Y.; Lin, L. Toward Characteristic-Preserving Image-Based Virtual Try-On Network (CP-VTON). In Computer Vision—ECCV 2018. Part XIII; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 607–623. [Google Scholar] [CrossRef]
  6. Choi, S.; Park, S.; Lee, M.; Choo, J. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14131–14140. [Google Scholar] [CrossRef]
  7. Yang, H.; Min, S.; Chen, X.; Wang, Y.; Lin, L. Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content (ACGPN). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11854–11863. [Google Scholar] [CrossRef]
  8. Lee, J.; Kim, S.; Kim, D.; Sohn, K. HR-VITON: High-Resolution Virtual Try-On via Joint Layout and Texture Learning. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 280–296. [Google Scholar] [CrossRef]
  9. Goodfellow, I.; Pouget-Abadie, J.; Mirze, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
  10. Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8176–8185. [Google Scholar] [CrossRef]
  11. Choi, Y.; Kwak, S.; Lee, K.; Choi, H.; Shin, J. IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-On in the Wild. arXiv 2024, arXiv:2403.05139. [Google Scholar] [CrossRef]
  12. Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Ottawa, ON, Canada, 27 October–3 November 2023. [Google Scholar] [CrossRef]
  13. Xu, Y.; Gu, T.; Chen, W.; Chen, C. OOTDiffusion: Outfitting Fusion Based Latent Diffusion for Controllable Virtual Try-On. arXiv 2024, arXiv:2403.01779. [Google Scholar] [CrossRef]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, L. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar] [CrossRef]
  15. Chong, Z.; Dong, X.; Li, H.; Zhang, S.; Zhang, W.; Zhang, X.; Zhao, H.; Jiang, D.; Liang, X. CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar] [CrossRef]
  16. Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. Tryondiffusion: A tale of two unets. In Proceedings of the Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 4606–4615. Available online: https://arxiv.org/abs/2306.08276 (accessed on 10 July 2025).
  17. Zhang, K.; Sun, M.; Sun, J.; Zhao, B.; Zhang, K.; Sun, Z.; Tan, T. Human Diffusion: A Coarse-To-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation. 2023. Available online: https://arxiv.org/abs/2211.06235 (accessed on 10 July 2025).
  18. Li, P.; Song, G.; Zhang, Y.; Tong, Z.; Wei, X.; Liu, Y.; Yang, X. Self-Correction for Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4213–4225. [Google Scholar] [CrossRef] [PubMed]
  19. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar] [CrossRef]
  20. Zhang, L.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
  21. Merjic. “majicMIX Realistic v7: Stable Diffusion Checkpoint Merge”. HuggingFace, 2023. Available online: https://huggingface.co/imagepipeline/MajicMIX-realistic (accessed on 25 November 2025).
Figure 1. Limitations of the previous Leffa model: (a) problem of detecting bottom and (b) synthesis limitations dependent on source input images.
Figure 1. Limitations of the previous Leffa model: (a) problem of detecting bottom and (b) synthesis limitations dependent on source input images.
Electronics 14 04710 g001
Figure 2. Proposed method: CaP-VTON(The blue arrow indicates the main data flow through which the input person and clothing images are transmitted to the input image (A) → pre-processed image (B) → final output image (C) through the two-stage inpainting and Leffa pipeline. The green arrow indicates a path through which the human parsing result obtained from Self-Correction for Human Parsing (SCHP) is commonly used in the two-stage inpainting mask generation process, and the orange arrow indicates a control flow through which the pose map extracted from the OpenPose is injected as condition information into the stable diffusion-based generator-skin inpainting step).
Figure 2. Proposed method: CaP-VTON(The blue arrow indicates the main data flow through which the input person and clothing images are transmitted to the input image (A) → pre-processed image (B) → final output image (C) through the two-stage inpainting and Leffa pipeline. The green arrow indicates a path through which the human parsing result obtained from Self-Correction for Human Parsing (SCHP) is commonly used in the two-stage inpainting mask generation process, and the orange arrow indicates a control flow through which the pose map extracted from the OpenPose is injected as condition information into the stable diffusion-based generator-skin inpainting step).
Electronics 14 04710 g002
Figure 3. Changes in output results according to Leffa clothing type.
Figure 3. Changes in output results according to Leffa clothing type.
Electronics 14 04710 g003
Figure 4. Preprocessed images for Stable Diffusion Generate Skin inpainting.
Figure 4. Preprocessed images for Stable Diffusion Generate Skin inpainting.
Electronics 14 04710 g004
Figure 5. Comparison of results with/without Openpose ControlNet.
Figure 5. Comparison of results with/without Openpose ControlNet.
Electronics 14 04710 g005
Figure 6. Examples of results of the Generate Skin pipeline.
Figure 6. Examples of results of the Generate Skin pipeline.
Electronics 14 04710 g006
Figure 7. Modified ensemble model of VITON-HD and DressCode Model.
Figure 7. Modified ensemble model of VITON-HD and DressCode Model.
Electronics 14 04710 g007
Figure 8. Comparison of two masking methods: red-highlighted regions for masking.
Figure 8. Comparison of two masking methods: red-highlighted regions for masking.
Electronics 14 04710 g008
Figure 9. Results of our method (on last column): the first column is input person image, and the second images are the clothes set for change. The third to seventh columns are the synthesized results of other methods.
Figure 9. Results of our method (on last column): the first column is input person image, and the second images are the clothes set for change. The third to seventh columns are the synthesized results of other methods.
Electronics 14 04710 g009
Figure 10. Ablation study consisting of (a) Leffa only, (b) Leffa + DressCode, (c) Leffa + Generate Skin, and (d) CaP-VTON of Leffa + DressCode + Generate Skin, and for each configuration, the same input image and reference clothing were applied to compare the short-sleeved silhouette and the degree of arm/shoulder skin exposure in the output image.
Figure 10. Ablation study consisting of (a) Leffa only, (b) Leffa + DressCode, (c) Leffa + Generate Skin, and (d) CaP-VTON of Leffa + DressCode + Generate Skin, and for each configuration, the same input image and reference clothing were applied to compare the short-sleeved silhouette and the degree of arm/shoulder skin exposure in the output image.
Electronics 14 04710 g010
Figure 11. Comparison of short-sleeve and sleeveless clothing outputs.
Figure 11. Comparison of short-sleeve and sleeveless clothing outputs.
Electronics 14 04710 g011
Figure 12. Comparison results of short-sleeved silhouette consistency (CLIP-based) by model.
Figure 12. Comparison results of short-sleeved silhouette consistency (CLIP-based) by model.
Electronics 14 04710 g012
Table 1. Numerical comparison between two models. Paired: If the costumes in the input image and the costume image provided as a reference are the same, the result of the production can be compared directly with the correct image because the model actually produces an image with the same clothes. Therefore, we used quantitative image quality indicators such as the SSIM and LPIPS, in addition to the FID. Unpaired: Refers to a case wherein the costumes in the input person image and the costume image given as reference are different. In this case, since there is no correct generated image, indicators such as SSIM or LPIPS cannot be used. Instead, we only used distribution-based indicators such as FID, which can evaluate the overall quality of generation. Indicators with down arrows have higher similarity with lower values, and indicators with up arrows have higher similarity with higher values.
Table 1. Numerical comparison between two models. Paired: If the costumes in the input image and the costume image provided as a reference are the same, the result of the production can be compared directly with the correct image because the model actually produces an image with the same clothes. Therefore, we used quantitative image quality indicators such as the SSIM and LPIPS, in addition to the FID. Unpaired: Refers to a case wherein the costumes in the input person image and the costume image given as reference are different. In this case, since there is no correct generated image, indicators such as SSIM or LPIPS cannot be used. Instead, we only used distribution-based indicators such as FID, which can evaluate the overall quality of generation. Indicators with down arrows have higher similarity with lower values, and indicators with up arrows have higher similarity with higher values.
ModelPairedUnpaired
FID ↓SSIM ↑LPIPS ↓FID ↓
VITON-HD4.540.8990.0488.52
DressCode5.050.9490.02110.73
Table 2. Numerical comparison with other models. The lower the FID and LPIPS and the higher the SSIM, the better the quality.
Table 2. Numerical comparison with other models. The lower the FID and LPIPS and the higher the SSIM, the better the quality.
ModelPairedUnpaired
FID SSIM LPIPSFID
CatVTON5.420.8700.0579.02
IDM-VTON5.760.8500.0639.84
OOTDiffusion9.300.8190.08812.41
Leffa4.540.8990.0488.52
CaP-VTON (Ours)6.550.8970.06211.28
Table 3. Numerical comparison with ablation studies(The results can be seen in Figure 10).
Table 3. Numerical comparison with ablation studies(The results can be seen in Figure 10).
ModelPairedUnpaired
FIDSSIMLPIPSFID
Leffa only4.540.899 0.048 8.52
Leffa-DressCode5.050.9490.02110.73
Leffa-Generate Skin8.770.72840.08917.14
CaP-VTON(Ours)6.550.89660.061711.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, S.; Lee, H.J.; Lee, J.; Lee, T. Clothing-Agnostic Pre-Inpainting Virtual Try-On. Electronics 2025, 14, 4710. https://doi.org/10.3390/electronics14234710

AMA Style

Kim S, Lee HJ, Lee J, Lee T. Clothing-Agnostic Pre-Inpainting Virtual Try-On. Electronics. 2025; 14(23):4710. https://doi.org/10.3390/electronics14234710

Chicago/Turabian Style

Kim, Sehyun, Hye Jun Lee, Jiwoo Lee, and Taemin Lee. 2025. "Clothing-Agnostic Pre-Inpainting Virtual Try-On" Electronics 14, no. 23: 4710. https://doi.org/10.3390/electronics14234710

APA Style

Kim, S., Lee, H. J., Lee, J., & Lee, T. (2025). Clothing-Agnostic Pre-Inpainting Virtual Try-On. Electronics, 14(23), 4710. https://doi.org/10.3390/electronics14234710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop