LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On

Zhang, Shufang; Wang, Lei; Ding, Wenxin

doi:10.3390/info16050408

Open AccessArticle

LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On

by

Shufang Zhang

^*

,

Lei Wang

and

Wenxin Ding

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(5), 408; https://doi.org/10.3390/info16050408

Submission received: 31 March 2025 / Revised: 6 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025

(This article belongs to the Special Issue Intelligent Image Processing by Deep Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Diffusion-based approaches have recently emerged as powerful alternatives to GAN-based virtual try-on methods, offering improved detail preservation and visual realism. Despite their advantages, the substantial number of parameters and intensive computational requirements pose significant barriers to deployment on low-resource platforms. To tackle these limitations, we propose a diffusion-based virtual try-on framework optimized through feature-level knowledge compression. Our method introduces MP-VTON, an enhanced inpainting pipeline based on Stable Diffusion, which incorporates improved Masking techniques and Pose-conditioned enhancement to alleviate garment boundary artifacts. To reduce model size while maintaining performance, we adopt an attention-guided distillation strategy that transfers semantic and structural knowledge from MP-VTON to a lightweight model, LiteMP-VTON. Experiments demonstrate that LiteMP-VTON achieves nearly a 3× reduction in parameter count and close to 2× speedup in inference, making it well suited for deployment in resource-limited environments without significantly compromising generation quality.

Keywords:

image inpainting; deep learning; virtual try-on; diffusion model; knowledge distillation; attention mechanisms

1. Introduction

Image-based Virtual Try-On (VTON) leverages computer vision, artificial intelligence, and related technologies to synthesize realistic images of individuals dressed in selected garments, aligned with their specific body poses. By replacing traditional physical VTON processes, this approach has attracted considerable interest from the e-commerce sector and is increasingly recognized as a transformative innovation in the fashion and retail domains.

Traditional virtual try-on methods are typically based on Generative Adversarial Networks (GANs) and follow a two-stage process: garment warping and refinement. In the warping stage, the clothing is deformed to match the target person’s body shape and pose, or obtained through human pose transfer techniques. In the refinement stage, a generative network improves the coarse output by enhancing the visual quality of the pasted and warped clothing region. However, GANs often struggle with complex clothing textures and fine details, leading to over-smoothing, blurring, or distortion, which negatively affects the user experience.

Recently, diffusion models [1,2,3,4,5] have recently outperformed GANs in both sample fidelity and training stability. These models synthesize images by modeling a forward diffusion process, which gradually adds noise to the data, and learning a reverse denoising process to recover the original image. Therefore, an increasing number of researchers are employing diffusion models for virtual try-on, such as WarpDiffusion [6], TryOnDiffusion [7], Ladi-VTON [8], and DCI-VTON [9]. These models have shown exceptional performance in generating high-resolution images and maintaining intricate clothing details. However, these methods still face issues such as inadequate rendering of hand details, artifacts along clothing edges, and low texture fidelity. Furthermore, as the generative advantages of diffusion models being contingent upon large-scale dataset training, the associated pre-training costs are exceedingly high, with model parameter counts reaching up to ten or even a hundred times those of traditional GAN architectures, thereby posing substantial challenges for model deployment.

To address the challenge of deploying excessively large models on edge devices, SnapFusion [10] and MobileDiffusion [11] have applied network compression techniques to diffusion models for text-to-image tasks, resulting in smaller models trained on large datasets. However, both methods still depend on vast computational resources during pre-training. Furthermore, neither approach has released their pre-trained models, preventing experimentation with fine-tuning smaller models for applications such as virtual try-on. Many researchers lack the necessary data and computational resources to train large-scale models, making the development of lightweight diffusion models for tasks like virtual try-on a new and pressing challenge.

Building on the previous analysis, this paper proposes MP-VTON to address the challenges of low edge realism and poor hand generation quality in clothing images. By enhancing mask strategy and incorporating pose control guidance, MP-VTON eliminates artifacts in the generated images and improves the details of hand generation, thereby enhancing both the realism and plausibility of the results. To mitigate model redundancy, this paper introduces a novel model distillation approach, LiteMP-VTON, which employs a cross-attention alignment strategy to distill the large virtual try-on model into a smaller one. The smaller model retains only 30% of the parameters of the original model, while preserving most of the generative capabilities of the diffusion model, with the Frechet Inception Distance (FID) score decreasing by only 10%. Additionally, the memory usage is reduced to one-third of the original, and the inference time is halved.

The structure of this paper is as follows. Section 2 provides an overview of related work on image-based VTON and knowledge distillation. Section 3 introduces our proposed MP-VTON and LiteMP-VTON methods. Section 4 proves the effectiveness of the proposed approaches. Section 5 summarizes the core contributions of MP-VTON and LiteMP-VTON.

2. Related Works

2.1. Image-Based Virtual Try-On

Traditional virtual try-on (VTON) methods are generally divided into two stages: training a clothing warping module and refining the generated image based on the warped results. VTON techniques commonly utilize Thin-Plate Spline (TPS) [12] interpolation to warp clothing, enabling a coarse-to-fine approach for virtual try-on. CP-VTON [13] is the first to introduce a two-stage virtual try-on framework, where a Geometric Matching Module explicitly predicts TPS parameters to align garment masks with body parts, followed by a Try-On Module that blends warped clothes and a person’s features through appearance compositing. Subsequent works [14,15,16,17] refine this pipeline by improving warping stability or adding feature-aligned attention for better texture retention. PF-AFN [18] combines knowledge distillation and appearance flow to predict a dense, continuous flow that preserves fine-grained wrinkles while enabling student networks to mimic teacher flow fields. VITON-HD [19] adopts an alignment-aware segment normalization to mitigate misalignment between clothing and the body, achieving high-resolution results for the first time. Later, HR-VITON [20] addresses occlusion issues through a feature fusion mechanism and incorporates a discriminator rejection mechanism to ensure robust performance. However, GAN-based VTON methods still face significant challenges in generating realistic results, particularly under complex poses and occlusion scenarios.

Over the past three years, diffusion models have demonstrated exceptional performance in image and video generation tasks. Compared to GANs, diffusion models offer significant advantages in terms of image quality, diversity, and stability, which has led to the development of various virtual try-on methods based on these models. TryOnDiffusion [7] employs a dual-UNet architecture, where a Pose-UNet synthesizes coarse person geometry and a Texture-UNet refines garment-specific details; its 1.3B parameters achieve state-of-the-art fidelity but demand prohibitive GPU hours. Subsequent methods have increasingly utilized large-scale pre-trained models as priors. WarpDiffusion [6] introduces a local clothing feature attention mechanism, combining an explicit deformation module with a pre-trained text-to-image diffusion model. PE-VITON [21], proposed by our team, is a two-stage personalized virtual try-on framework that integrates pose-guided shape alignment with diffusion-based texture enhancement, achieving decoupled optimization of garment form and detail. LaDi-VTON [8] applies text inversion to incorporate textual control, allowing user-editable garment prompts while using Enhanced Mask-Aware Skip Connections to recover face and hand details. DCI-VTON [9] uses the deformed clothing as a local condition guide and leverages the inpainting capabilities of pre-trained diffusion models, such as the Stable Diffuison Inpainting Model developed by Stability AI, to perform the try-on task. This method significantly improves clothing fidelity by warping and pasting the clothing onto an image of the body.

However, these approaches [8,9,21] still encounter issues such as clothing edge artifacts and poor hand generation quality. To address these challenges, this paper introduces an enhanced masking strategy to mitigate clothing edge artifacts, incorporating DWPose [22] skeleton information for pose guidance, as detailed in Section 3.3.

2.2. Knowledge Distillation

Knowledge distillation (KD) [23] seeks to efficiently transfer the knowledge from a pre-trained teacher model to a smaller student model, with the goal of ensuring that the student model’s performance closely approximates that of the teacher. Typical methods achieve this by minimizing the Kullback–Leibler (KL) [24] divergence loss to align the output distributions of both models. To improve the distillation process, several knowledge transfer strategies have been proposed, which can be broadly classified into three categories: logit-based methods [25,26], feature-based methods [27,28], and relation-based methods [29].

Logit-based Distillation: Logit-based distillation methods [23,25,26] primarily focus on distilling information from the output layer, using the Softmax output of teacher model as soft targets to train the student model. Temperature scaling [25] exposes inter-class “dark knowledge”, enabling the student to learn finer decision boundaries without extra structural alignment. This approach enhances the performance of smaller networks and is both simple and effective, making it applicable to a wide range of knowledge distillation tasks.
Feature-based Distillation: Feature-based distillation methods aim to replicate feature representations in the intermediate layers of the teacher model. FitNets [28] trains the student with additional hints from a teacher’s mid-level layer; in a later study, FactorTransfer [27] further decomposes features into orthogonal factors, making channel-wise alignment more robust to architecture mismatch.
Relation-based Distillation: Relation-based KD explores pairwise or higher-order relations among samples and layers. For instance, RKD [29] encodes pairwise distance and angle information, encouraging the student to reconstruct the teacher’s metric space rather than exact activations, which proves effective for heterogeneous backbones.

In recent years, research [30,31,32,33] on distillation around diffusion models has primarily focus on inference acceleration and performance improvement. Salimans and Ho [30] proposed a progressive distillation method aimed at accelerating the sampling speed of diffusion models, which recursively trains a student model to replace two successive denoising steps of its teacher with a single step, thereby halving the number of inference steps at each stage while preserving image quality. Sun et al. [31] introduced a classifier-based feature distillation technique to expedite the sampling process. These methods enhance inference efficiency by gradually simplifying the model architecture and reducing the number of sampling steps.

However, there are few approaches available for compressing the scale of diffusion models. SnapFusion [10] enhances diffusion models for text-to-image tasks through model distillation, thereby reducing the number of inference steps required during reverse diffusion. However, it only decreases the parameters of the Variational AutoEncoder (VAE) [34], leaving the UNet [35] architecture unchanged. In contrast, MobileDiffusion [11] compresses the UNet structure in text-to-image models by utilizing depthwise separable convolutions and reducing residual blocks, nearly halving the parameter count. Despite these optimizations, training diffusion models still requires substantial amounts of data, and knowledge distillation is primarily used to accelerate inference. Since the experimental models from both methods are not open-sourced, it is not possible to directly fine-tune them to create lightweight virtual try-on models. Additionally, training with large datasets is impractical for academic researchers. When training data are limited, directly training the model can lead to model collapse. To address this, the present work employs knowledge distillation to train a lightweight diffusion model for virtual try-on, allowing the smaller model to retain much of the generative capability of the larger model despite limited data. This approach enhances both the generation quality and robustness of the resulting model.

3. Methods

3.1. Primary Knowledge

Latent Diffusion Model: A diffusion model [1] is a probabilistic model that estimates the data distribution

p (x)

by progressively denoising a variable drawn from a normal distribution. The core idea of a latent diffusion model [36] (LDM) is to map the image input into a low-dimensional latent space that adheres to a normal distribution, using a VAE [34]. This transformation reduces computational complexity, facilitating both model training and deployment. The generative objective of a latent diffusion model can be summarized as follows:

L_{simple} = E_{z \sim ε (x), C, ϵ \sim N (0, 1), t} [∥ ϵ - ϵ_{θ} (z_{t}, t, C) ∥_{2}^{2}]

(1)

where

ε

represents the VAE, C is the conditioning information for the diffusion model, t denotes the timestamp indicating the number of noise addition or removal steps,

ϵ

is the initially added noise,

ϵ_{θ}

is the noise prediction function, and

z_{t}

is the encoded feature with added noise. The VAE

ε

compresses the input data x, and the UNet-based denoising model

ϵ_{θ}

, guided by the condition C, performs denoising according to the time step t, aiming to eliminate noise of magnitude

ϵ

.

Stability AI trained models based on LDM and released the Stable Diffusion series. These pre-trained models are commonly used as foundational models for various generation tasks [37] and are fine-tuned using techniques such as LoRA [38], ControlNet [39], and IP-Adapter [40].

3.2. Framework

This paper proposes a virtual try-on method based on the Stable Diffusion Inpainting Model (SDIM), named MP-VTON, which enhances the realism and plausibility of generated clothing images by optimizing the conditioning inputs. The inference pipeline is illustrated in Figure 1. Building upon this, a lightweight variant called LiteMP-VTON is introduced. It incorporates a feature-based cross-attention alignment distillation module (CADM) and is trained on relatively limited data, ensuring alignment between the generation paths of the teacher and student models, as shown in Figure 2.

3.3. MP-VTON

The Stable Diffusion Inpainting Model offers a compelling solution for image restoration tasks, providing new insights for precise virtual try-on applications. Inspired by the DCI-VTON, we adopted the Stable Diffusion Inpainting Model as the base model. Initially, given a model image

I_{p} \in R^{H \times W \times 3}

and a target clothing image

I_{c} \in R^{H^{'} \times W^{'} \times 3}

, the PF-AFWM [18] warping module applies appearance flow-based distortion to the clothing image using the model’s information, resulting in the warped clothing image

I_{warp}

, as shown in the upper left corner of Figure 1. Then, the warped garment is attached to the clothing-agnostic person image for further refinement by utilizing the diffusion model. To address common challenges in virtual try-on diffusion models, such as inadequate hand generation and clothing edge artifacts, we propose an improved masking rule and pose-guided enhancement techniques.

Improved Masking Rule: In inpainting tasks, the mask regions

m \in {0, 1}^{H \times W}

directly define the area where the inpainting model will generate content, so it is crucial for achieving accurate restoration results. In previous studies, mask regions typically preserved the entire skin of the head and neck while covering the upper clothing and the model’s upper body skin based on the contour. Although these masking rules are straightforward to implement, the model tends to generate clothing by relying on external contour information during the generation process. For instance, when a model wearing loose clothing is changed to tight-fitting clothing, redundant sleeve details may appear. Additionally, skin around the collar can interfere with the contour of the target clothing. To address these issues, we propose an improved masking rule.

As illustrated in the lower left part of Figure 1, we employ Human Parsing [18] data

I_{h}

in conjunction with semantic segmentation from DensePose [41]

I_{s}

to construct a rectangular bounding box mask. The Human Parsing annotations provide semantic regions corresponding to the neck, upper garment, and arms of the dressed model. These regions are merged to form a unified area representing the person’s upper torso, which is highlighted in purple in

I_{schematic}

, located in the lower left corner of Figure 1. The rectangular mask is then defined by the top, bottom, left, and right boundary points of this region, denoted as points A, B, C, and D, respectively. For female subjects, long hair may overlap the warped garment and be erroneously interpreted as clothing texture, degrading generation quality. Therefore, we remove hair below the jawline and retain only the region above point E in Figure 1. In the final composition, essential parts of the original person are overlaid back onto the image. In addition to the warped garment

I_{warp}

, the refined head region

I_{head}

, and the lower body

I_{bottom}

, the person’s hands

I_{hand}

are also preserved. Diffusion models often struggle with realistic hand synthesis, frequently resulting in visible artifacts and distorted structures. To mitigate this, we preserve the original hand skin regions for SDIM inpainting. Since Human Parsing labels the entire upper limb as one region, we employ DensePose cues for fine-grained hand segmentation, as shown in the lower left corner of Figure 1. The final input image for inpainting is obtained by compositing all necessary components, with its corresponding mask region illustrated on the right side of Figure 1. The inpainting input

I_{inpaint}

and its corresponding mask

M_{inpaint}

are defined in Equation (3) and Equation (2), respectively.

M_{inpaint} = M_{square} - (M_{square} \cap I_{head}) - (M_{square} \cap I_{hand}) - (M_{square} \cap I_{warp})

(2)

I_{inpaint} = I_{warp} + I_{cover} + I_{bottom} + I_{hand}

(3)

Thus, the proposed strategy eliminates the generation process’s reliance on the original garment boundaries, and the resulting improvements are visually compared in Section 4.3. Due to its powerful generative ability, the pre-trained SDIM is capable of accurately restoring the masked pixels. Moreover, fine-tuning the model on the VTON task with a global generative loss further improves the realism of the synthesized image and mitigates artifacts that typically appear along the boundary seams.

Pose-Guided Enhancement: The improved masking strategy described previously prevents the original clothing contour information from influencing the image generation process. However, this also diminishes the representation of the subject’s pose information. To address the issue of insufficient pose information, we modified the input conditions by incorporating skeleton information from the person’s image into the model. After evaluation, we selected the DWPose [22] model as the skeleton extraction network, due to its higher efficiency and more accurate and complete recognition of hand skeletons compared to models such as OpenPose [42].

In SDIM, the input consists of the noise, the mask, and the image to be inpainted, resulting in a 9-channel input. These include two 4-channel inputs representing the VAE encodings of the noised original image and the image to be inpainted, respectively, as well as a 1-channel input for the inpainting mask. To incorporate pose control conditions seamlessly, we extended the input by adding a 4-channel VAE encoding of the RGB skeleton map

I_{pose}

shown in the right part of Figure 1. By modifying the parameters of the input convolutional layers, the model was configured to accept 13-channel inputs. Specifically, the original 9-channel parameters were retained, and the convolutional layers for the newly added 4-channel features were initialized with zeros.

3.4. LiteMP-VTON

In this section, we seek to reduce redundancy within the virtual try-on diffusion model to develop a lightweight version. As previously mentioned, directly training a small diffusion model requires a substantial amount of training data, making pre-training tasks challenging for most researchers, including ourselves. When training data are insufficient, the model is prone to overfitting, resulting in significantly poorer performance on test data compared to training data. To address this issue, we adopt the workflow illustrated in Figure 2 and optimize the virtual try-on diffusion model through knowledge distillation, allowing the student model to learn as much VTON-related information as possible from the teacher model. Within the diffusion model, the UNet contains approximately 800 M parameters, constituting the largest portion of the model’s total parameter count. By utilizing the cross-attention alignment distillation module proposed in this section, we can reduce the number of parameters in the UNet by 70%, and double the inference speed.

Figure 2. An overview of the knowledge distillation process of MP-VTON. (a) illustrates the basic framework of LiteMP-VTON obtained through layer-by-layer distillation. (b) provides details of each module: neither the student model nor the teacher model utilizes the CLIP model [43], bypassing the cross-attention module. The output features

F_{T}

and

F_{S}

from the self-attention layers are used for feature distillation, aligning the student model’s features spatially with those of the teacher model via the cross-attention alignment distillation module (CADM).

Figure 2. An overview of the knowledge distillation process of MP-VTON. (a) illustrates the basic framework of LiteMP-VTON obtained through layer-by-layer distillation. (b) provides details of each module: neither the student model nor the teacher model utilizes the CLIP model [43], bypassing the cross-attention module. The output features

F_{T}

and

F_{S}

from the self-attention layers are used for feature distillation, aligning the student model’s features spatially with those of the teacher model via the cross-attention alignment distillation module (CADM).

3.4.1. Teacher Model

In Stable Diffusion Inpainting Models, conditional guidance is typically provided through descriptive text or image encodings as input, offering directional constraints to the redrawing process. Conditional guidance is primarily employed to regulate low-frequency information, such as object categories and clothing shapes, while offering limited control over the finer details of the generated outcomes [36,44,45]. The CLIP model [43] typically downsamples input images by a factor of 14, resulting in a significant loss of high-frequency texture information. Moreover, the embeddings obtained through encoding have relatively weak capabilities in capturing high-frequency details.

To address these limitations, we modified MP-VTON’s architecture, as illustrated in Figure 2b, by incorporating skip connections that bypass the cross-attention modules. This modification eliminates the reliance on text or image encodings provided by the CLIP model as conditional input. In our neural network design, the input conditions incorporate warped clothing images, so that low-frequency information is directly supplied to the network, thereby obviating the need for additional conditional guidance in most scenarios. Moreover, simplifying this segment of the neural network structure also reduces the complexity of distillation training for the student model.

3.4.2. Student Model

The student model adopts the identical architectural structure as the teacher model, with modifications restricted solely to the number of channels. The input and output of the student model remain unchanged. The UNet module, serving as the core component for denoising tasks, consists of structural blocks that include residual modules, self-attention modules, and up-sampling/down-sampling modules. Specifically, the UNet layers in the teacher model have channel sizes of 320 and 640, and two consecutive stages with 1280 channels. To balance model capacity and efficiency, SDIM retains two consecutive encoder stages at 1280 channels rather than increasing to 2560. This design choice effectively limits parameter growth and memory usage, while maintaining dimensional consistency with the mid-block to facilitate residual connections and cross-attention mechanisms.

In contrast, the student model reduces these to 320 and 448, and two consecutive stages with 640 channels. In the UNet backbone, the majority of parameters reside in the convolutional and attention modules, with convolutional layers—implemented uniformly with

3 \times 3

kernels—contributing the largest share. Within each residual block, the parameter count of a convolutional layer grows quadratically with the channel width, i.e.,

{Param}_{conv} \propto C^{2}

. The same quadratic dependency applies to the fully connected projections that generate the Query, Key, and Value representations in the attention layers. Consequently, the overall parameter budget can be approximated by summing the squared channel widths across all hierarchical stages, and the proportion of parameters retained in a student network can be estimated by Equation (4):

Retained - parameter ratio = \frac{\sum_{i} {C_{i}^{'}}^{2}}{\sum_{i} C_{i}^{2}} = \frac{1 122 304}{3 788 800} \approx 0.296

(4)

where

C_{i}

and

C_{i}^{'}

denote the channel widths of the i-th stage in the teacher and student models, respectively. In practice, the student model contains 286 million parameters, representing an approximate 70% reduction compared to the teacher model’s 898 million parameters.

3.4.3. Cross-Attention Alignment Distillation Module

Considering the complexity of the UNet architecture, which encompasses residual modules, self-attention modules, and sampling modules, relying only on logit-based distillation is insufficient to guarantee the fidelity of the generated images. Therefore, this study adopts a layer-by-layer distillation strategy. The student and teacher models have different numbers of feature channels, resulting in different receptive fields. Therefore, finding a method to align the teacher and student models in the feature space is very important. To this end, we introduce a distillation module based on cross-attention alignment.

The core concept of the cross-attention alignment distillation module (CADM) is to establish an alignment mechanism that maps both the student’s and teacher’s features into a shared semantic space. The cross-attention mechanism aligns information from different feature spaces and selectively filters out critical feature information, thereby facilitating a more effective integration of multi-source features and enhancing the model’s ability to express and comprehend information. In this work, CADM serves as the key component for both feature extraction and alignment, by utilizing the outputs of each attention layer, thereby guiding the student model’s distillation process at the feature level.

Given the attention layer output features

f_{t}

from the teacher model and

f_{s}

from the student model, the mapping functions

δ

,

γ

, and

ϕ

are employed to compute the Query, Key, and Value, respectively. The teacher model’s features are used to compute the Query, while the student model’s features are used to compute the Key and Value. Equation (5) provides the details of this formulation, where

d_{k}

represents the number of channels in the teacher model’s features:

L_{CADM} = {∥Softmax (\frac{δ (f_{t}) \cdot γ (f_{s})}{\sqrt{d_{k}}}) \cdot ϕ (f_{s}) - f_{t}∥}_{2}

(5)

The CADM utilizes the teacher model’s features as queries to identify the most similar features within the student model. The resulting differences between the student and teacher features are then employed as the distillation loss. By freezing the teacher model’s parameters, the student model continuously adjusts its own parameters during training to identify and rectify discrepancies relative to the teacher model.

3.4.4. Loss Function

During the distillation training process, the parameters of the teacher model and the VAE component within the student model are completely frozen. In contrast, the parameters of the student model’s UNet and the CADM are actively updated. The output loss for knowledge distillation is divided into hard-label loss and soft-label loss. Hard-label loss involves using real images as ground truth to compute the loss, thereby ensuring accurate generation results. In diffusion models, the noise introduced to the images and the noise predicted by the model function as hard labels, as illustrated in Equation (1). The closer the predicted noise aligns with the added noise, the higher the fidelity of the final generated image.

Soft-label loss treats the teacher model’s output as the ground truth, allowing the student model to learn the teacher model’s generation strategy and decision-making process. In LiteMP-VTON, the soft label loss is implemented using the Mean-Squared Error (MSE) loss function. The MSE loss quantifies the discrepancy between the denoised images generated by the student and teacher models, equivalent to calculating the average squared difference between the noise predicted by the student and the noise predicted by the teacher. During training, the noise added to the image is denoted as

ϵ

, and the MSE loss can be defined as

L_{MSE} = \frac{1}{n^{2}} \sum {(ϵ_{s} - ϵ_{t})}^{2}

(6)

where

ϵ_{t}

denotes the noise removal intensity predicted by the teacher model, and

ϵ_{s}

represents the noise intensity predicted by the student model. This provides a clear metric for evaluating the accuracy of the model’s predictions. A lower MSE value indicates a smaller error between the predicted and actual noise values. Additionally, this approach incorporates VGG [46] perceptual loss to compute the loss between images generated by the teacher and student models. This loss function evaluates the discrepancies in high-dimensional feature representations, thereby refining the quality of the generated images.

L_{VGG} (I_{s}, I_{t}) = \sum_{i = 0}^{5} λ_{i} {∥ϕ_{i} (I_{s}) - ϕ_{i} (I_{t})∥}_{1}

(7)

where

ϕ_{i} (I_{s})

denotes the feature map of the student-generated image

I_{s}

at the i-th layer of the pre-trained VGG-19 network

ϕ

. Here, i ranges from 1 to 5, corresponding to the layers Conv1_2, Conv2_2, Conv3_2, Conv4_2, and Conv5_2, respectively.

The distillation loss is computed by calculating the Euclidean distance between the attention layer features of the aligned teacher and student models at each CADM. Since the features of each layer are normalized and averaged across dimensions, no weighting is necessary.

Above all, the total distillation loss

L_{total}

for LiteMP-VTON is defined as Equation (8), where

λ_{MSE}

,

λ_{VGG}

, and

λ_{CADM}

are the weights for the MSE, VGG, and CADM losses, respectively.

L_{total} = L_{simple} + λ_{MSE} L_{MSE} + λ_{VGG} L_{VGG} + λ_{CADM} L_{CADM}

(8)

4. Experiments

4.1. Experimental Setup

Dataset: This study uses the VITON-HD dataset [19] for virtual try-on experiments, which comprises 13,679 pairs of frontal human images and their corresponding upper body clothing layout images. The images have a resolution of

1024 \times 768

, and this dataset is commonly referred to as a paired dataset in the field of virtual try-on. In our experiments, all images were downsampled to

512 \times 384

for training and inference.

To ensure high generation quality, we performed a preliminary data cleaning process by removing samples prone to distortion artifacts, particularly those in which the subjects appear in back-facing or extreme side-facing poses. Specifically, human keypoints were extracted using DWPose. Side-view samples were detected based on a low ratio between shoulder width and torso length. Back-facing poses were identified when the right shoulder appeared to the left of the left shoulder in image coordinates, or when head keypoints were missing. All flagged samples were subsequently reviewed manually to ensure the accuracy of the cleaning process. Consequently, 12,254 pairs were retained, with 10,286 pairs allocated to the training set and 1968 pairs designated for testing.

Evaluation Metrics: In evaluating virtual try-on performance, we considered both paired and unpaired scenarios. For paired images, we utilized the Structural Similarity Index [47] (SSIM) and Learned Perceptual Image Patch Similarity [48] (LPIPS) as metrics for image reconstruction. For unpaired images, where original images are unavailable, we employed the Frechet Inception Distance [49] (FID) and Kernel Inception Distance [50] (KID) to assess the model’s capability in generating clothing-swapped images. Our evaluation of VITON-HD encompasses all the aforementioned metrics.

Implementation Details: This study adopts DCI-VTON as the backbone network for experiments and replicates its weights to construct the pre-trained model. We trained the model for 40 epochs on two NVIDIA RTX A6000 GPUs, with a batch size of 6 and a learning rate of

1 \times 10^{- 5}

. We employed the AdamW optimizer [51] with

β_{1}

set to 0.9 and

β_{2}

set to 0.999. To ensure that each loss component operates on the same scale, the hyperparameters

λ_{MSE}

,

λ_{VGG}

, and

λ_{CADM}

in the loss function were set to 0.5, 0.01, and 0.03, respectively.

4.2. Quantitative Evaluation

Table 1 presents a comparative analysis of our method against VITON-HD, HR-VITON, and DCI-VTON using the VITON-HD dataset. By improving the masking rules and incorporating pose control conditions, the MP-VTON achieves improved fidelity compared to DCI-VTON. To increase distillation efficiency, the removal of the CLIP condition guidance reduces both the number of parameters and computational complexity, resulting in only a slight performance decrease relative to the base model.

In experiments where 70% of the UNet architecture parameters were compressed, the resulting distillation model, although still possessing more parameters than traditional virtual try-on methods such as VITON-HD and HR-VITON, outperforms these traditional models in both quantitative metrics and qualitative outcomes. Furthermore, compared to other diffusion-based methods, our approach significantly reduces the number of parameters, thereby validating the effectiveness of the proposed distillation method.

To more intuitively illustrate the relationship between the evaluation metrics and model size, Figure 3 is presented to showcase the comparative results of parameter counts and generation performance between our method and previous virtual try-on approaches.

4.3. Qualitative Evaluation

The results of the improved masking rule are illustrated in Figure 4 and compared with those of the previous method in DCI-VTON. As illustrated, the previous approach constrains the mask to the exact garment contour of the source person image, forcing the target garment to be synthesized along this contour and thereby compromising realism. By contrast, our method defines a larger rectangular area around the garment as the repainting region, effectively eliminating contour-dependent artifacts. Leveraging the global generative loss of the diffusion model, the garment region is further refined, preventing unnatural distortions and improving visual fidelity.

We conducted a visual quality comparison on the VITON-HD dataset between several open-source VTON methods and the method we proposed, as shown in Figure 5. The MP-VTON introduced in this study effectively addresses common challenges found in traditional methods, including artifacts, blurriness, and unrealistic image generation. Moreover, it prevents clothing contours from negatively influencing the generation results, a problem clearly evident in models such as DCI-VTON. Furthermore, the LiteMP-VTON distilled using the proposed CADM inherits the strengths of the teacher model and demonstrates superior realism in the generated images compared to traditional models.

4.4. Ablation Experiments

To validate the effectiveness of the proposed MP-VTON and LiteMP-VTON methods, we designed ablation experiments to test each module incrementally. The quantitative ablation results are presented in Table 2, where A denotes the baseline method using SDIM, B represents the improved masking rule, C corresponds to the pose-guided enhancement condition, and D indicates the CLIP condition. In MP-VTON, we improved the masking rule and incorporated pose control conditions. As illustrated in columns 4 and 5 of Figure 5, the modification of the masking rule significantly alleviates the clothing contour artifact issues present in DCI-VTON. By eliminating the guidance from cloth contours, the model no longer relies on these contours for clothing generation, resulting in more precise and naturalistic results.

Table 2 shows that the integration of pose-guided enhancement compensates for the absence of human pose information in the input images due to the masking rule, thus improving the performance metrics of quantitative generation. Furthermore, pose maps based on DWPose effectively preserve comprehensive hand information, enabling controlled rendering of hand gestures. In gesture control conditions, large models exhibit only minor visual improvements, whereas distilled small models demonstrate substantial enhancements. As depicted in Figure 6, the LiteMP-VTON distilled with gesture conditions demonstrates the superior learning of hand information throughout the training process. Additionally, integrating pose guidance doubles the convergence rate of training large models compared to those without, thereby further demonstrating its beneficial role.

CLIP guidance [44] is vital for most diffusion model-based generation tasks. In virtual try-on applications, CLIP encoding is typically employed to process flattened cloth information. However, CLIP encoding primarily influences the low-frequency components of generated images and does not effectively impact high-frequency details such as textures and colors. Consequently, when our input conditions already include low-frequency information, CLIP image encoding offers limited enhancement to the generation results. In the fifth and sixth columns of Figure 5, we compare the effect of adding CLIP conditions on the experimental outcomes, and Table 2 demonstrates that the reduction in relevant metrics is slight. To reduce the overall parameter count and simplify the distillation process, removing CLIP condition guidance is a viable option.

4.5. Comparative Experiments on the Distillation Process

To validate the effectiveness of our proposed method described in Section 3.4.3, we conducted a series of comparative experiments. Since research on layer-wise distillation for the UNet architecture in diffusion models remains limited, we designed four experimental cases for comparison.

Case 1 serves as the baseline with output-level distillation only, providing minimal supervision. Case 2 adds layer-wise MSE-based feature alignment to assess the effect of denser supervision. Our proposed CADM method builds on cross-attention, where one feature map serves as the Query and another serves as the Key and Value. To explore the effect of attention direction, we compared two Query–Key–Value (QKV) configurations: (i) Student→Teacher (Case 3) and (ii) Teacher→Student (Case 4), the latter being our focal CADM design. The specific design is detailed as follows:

Case 1: Utilizes the virtual try-on teacher model and student model described above, with distillation applied only at the output layer. The output loss is computed using a combination of Mean-Squared Error (MSE) and VGG perceptual loss.

Case 2: Extends Case 1 by integrating layer-by-layer distillation, wherein the distillation loss is calculated using the MSE between the projected student features and the corresponding teacher features.

Case 3: Similarly to Case 2, Equation (9) is used to replace the MSE loss, with the

ℓ_{2}

norm used to compute the distance between the attention-weighted teacher features and the original teacher features. This formulation further considers the inter-layer correlation between the teacher and student models. Here, the student feature

f_{s}

serves as the Query, and the teacher feature

f_{t}

supplies the Key/Value, allowing the student to imitate teacher representations:

L_{layer} = {∥Softmax (\frac{δ (f_{s}) \cdot γ (f_{t})}{\sqrt{d_{k}}}) \cdot ϕ (f_{t}) - f_{t}∥}_{2}

(9)

Case 4: Similarly to Case 3, Equation (5) is used to replace Equation (9), while the layer-wise distillation framework is retained, corresponding to the proposed CADM method described in Section 3.4.3. This configuration adopts the second QKV configuration, swapping the roles of

f_{s}

and

f_{t}

in the subtractive term of Equation (9).

Table 3 reports the quantitative comparisons across the four experimental cases. Case 1, which applies distillation exclusively at the output layer and achieves the most aggressive parameter reduction, is unable to meet the essential fidelity requirements of the virtual try-on task. In Case 2, we augment the layer-by-layer distillation framework with an additional linear loss term, yielding a measurable improvement over the output-only distillation baseline. Nevertheless, despite this enhancement, the resulting performance metrics remain inferior to those obtained by GAN-based VTON methods. This outcome suggests that output-level and simple layer-wise losses alone are insufficient to fully preserve the rich feature representations necessary for high-fidelity garment synthesis.

Cases 3 and Case 4 refine the distillation framework by adding a cross-attention module that aligns the teacher and student features at every layer, thereby capturing their contextual dependencies. Both variants outperform Case 2. Their key difference lies in the source of attention: Case 3 generates attention from teacher features, while Case 4 (CADM) derives attention based on student features interacting with the teacher.

Empirically, the student-driven weighting in Case 4 achieves better distillation metrics than the teacher-driven weighting in Case 3. Conceptually, it also better reflects the asymmetric nature of knowledge transfer, guiding the student to adapt to the teacher rather than filtering the teacher’s own representation. Under this distillation scheme, Case 4 achieves the best overall results, outperforming traditional methods with a comparable parameter count and showing only a 10% reduction in FID compared to the teacher model, thereby validating the effectiveness of our approach.

4.6. Inference Efficiency Analysis

The proposed layer-wise distillation method using CADM significantly improves inference efficiency while maintaining generation quality. Specifically, it reduces the number of parameters in the student model by 70% and achieves noticeable gains in both memory consumption and computational speed. All inference experiments were conducted on a single NVIDIA RTX 3090 GPU with 50 sampling steps. Table 4 demonstrates that the student model significantly improves efficiency, using only one-third of the memory and achieving twice the inference speed of the teacher model.

5. Conclusions

In this work, we first addressed the issue of unrealistic contours in virtual try-on systems by introducing MP-VTON. By improving the masking rule and incorporating pose control guidance, we effectively eliminated the occurrence of artifacts. Subsequently, we developed LiteMP-VTON, a lightweight diffusion model for virtual try-on. Leveraging the proposed cross-attention alignment distillation module and employing knowledge distillation techniques for efficient training, we reduced the model’s redundant parameters. Experimental validation demonstrates that the proposed method reduces the size of the virtual try-on diffusion model by 70% without significantly compromising generation quality, and cuts inference time by half, thereby highlighting the practicality and efficiency of the proposed approach. These results provide a reference for developing and deploying diffusion model-based sub-task models on devices with limited memory resources. Future work will focus on integrating inference acceleration techniques and exploring the deployment of virtual try-on diffusion models on edge devices.

Author Contributions

Conceptualization, S.Z. and L.W.; Data curation, S.Z.; Methodology, L.W. and W.D.; Software, L.W. and W.D.; Validation, S.Z.; Visualization, L.W.; Writing—original draft, L.W. and W.D.; Writing—review and editing, S.Z. All authors have read and agreed to the published version of this manuscript.

Funding

This research was funded by the Graduate Research Innovation Project of Tianjin, China (2021YJSS024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar]
Li, X.; Kampffmeyer, M.; Dong, X.; Xie, Z.; Zhu, F.; Dong, H.; Liang, X. WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on. arXiv 2023, arXiv:2312.03667. [Google Scholar]
Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4606–4615. [Google Scholar]
Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 8580–8589. [Google Scholar]
Gou, J.; Sun, S.; Zhang, J.; Si, J.; Qian, C.; Zhang, L. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7599–7607. [Google Scholar]
Li, Y.; Wang, H.; Jin, Q.; Hu, J.; Chemerys, P.; Fu, Y.; Wang, Y.; Tulyakov, S.; Ren, J. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Adv. Neural Inf. Process. Syst. 2024, 36, 20662–20678. [Google Scholar]
Ondrúška, P.; Kohli, P.; Izadi, S. Mobilefusion: Real-time volumetric surface reconstruction and dense tracking on mobile phones. IEEE Trans. Vis. Comput. Graph. 2015, 21, 1251–1258. [Google Scholar] [CrossRef]
Keller, W.; Borkowski, A. Thin plate spline interpolation. J. Geod. 2019, 93, 1251–1269. [Google Scholar] [CrossRef]
Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; Yang, M. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 589–604. [Google Scholar]
Xie, Z.; Huang, Z.; Dong, X.; Zhao, F.; Dong, H.; Zhang, X.; Zhu, F.; Liang, X. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23550–23559. [Google Scholar]
Wan, Y.; Ding, N.; Yao, L. FA-VTON: A Feature Alignment-Based Model for Virtual Try-On. Appl. Sci. 2024, 14, 5255. [Google Scholar] [CrossRef]
Chen, C.; Ni, J.; Zhang, P. Virtual Try-On Systems in Fashion Consumption: A Systematic Review. Appl. Sci. 2024, 14, 11839. [Google Scholar] [CrossRef]
Xie, Z.; Huang, Z.; Zhao, F.; Dong, H.; Kampffmeyer, M.; Dong, X.; Zhu, F.; Liang, X. Pasta-gan++: A versatile framework for high-resolution unpaired virtual try-on. arXiv 2022, arXiv:2207.13475. [Google Scholar]
Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; Luo, P. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8485–8493. [Google Scholar]
Choi, S.; Park, S.; Lee, M.; Choo, J. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14131–14140. [Google Scholar]
Lee, S.; Gu, G.; Park, S.; Choi, S.; Choo, J. High-resolution virtual try-on with misalignment and occlusion-handled conditions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 204–219. [Google Scholar]
Zhang, S.; Ni, M.; Chen, S.; Wang, L.; Ding, W.; Liu, Y. A two-stage personalized virtual try-on framework with shape control and texture guidance. IEEE Trans. Multimed. 2024, 26, 10225–10236. [Google Scholar] [CrossRef]
Yang, Z.; Zeng, A.; Yuan, C.; Li, Y. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4210–4220. [Google Scholar]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Joyce, J.M. Kullback-Leibler Divergence; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Müller, R.; Kornblith, S.; Hinton, G.E. When Does Label Smoothing Help? Advances in Neural Information Processing Systems. 2019, pp. 4694–4703. Available online: https://dl.acm.org/doi/10.5555/3454287.3454709 (accessed on 30 March 2025).
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
Kim, J.; Park, S.; Kwak, N. Paraphrasing Complex Network: Network Compression via Factor Transfer. Advances in Neural Information Processing Systems 2018. pp. 2765–2774. Available online: https://dl.acm.org/doi/10.5555/3327144.3327200 (accessed on 30 March 2025).
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
Salimans, T.; Ho, J. Progressive distillation for fast sampling of diffusion models. arXiv 2022, arXiv:2202.00512. [Google Scholar]
Sun, W.; Chen, D.; Wang, C.; Ye, D.; Feng, Y.; Chen, C. Accelerating diffusion sampling with classifier-based feature distillation. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 810–815. [Google Scholar]
Wang, C.; Guo, Z.; Duan, Y.; Li, H.; Chen, N.; Tang, X.; Hu, Y. Target-Driven Distillation: Consistency Distillation with Target Timestep Selection and Decoupled Guidance. arXiv 2024, arXiv:2409.01347. [Google Scholar] [CrossRef]
Yang, D.; Liu, S.; Yu, J.; Wang, H.; Weng, C.; Zou, Y. Norespeech: Knowledge distillation based conditional diffusion model for noise-robust expressive tts. arXiv 2022, arXiv:2211.02448. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Liao, W.; Jiang, Y.; Liu, R.; Feng, Y.; Zhang, Y.; Hou, J.; Wang, J. Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection. Information 2025, 16, 197. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Güler, R.A.; Neverova, N.; Kokkinos, I. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7297–7306. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Kim, G.; Kwon, T.; Ye, J.C. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems. 2017, pp. 6629–6640. Available online: https://dl.acm.org/doi/abs/10.5555/3295222.3295408 (accessed on 30 March 2025).
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying mmd gans. arXiv 2018, arXiv:1801.01401. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Overview of MP-VTON. Based on SDIM, we improve the masking rule and incorporate the pose-guided enhancement. The key optimization of the masking rule lies in the use of rectangle-based mask construction combined with mask refinement utilizing Human Parsing

I_{h}

and Densepose

I_{s}

, which effectively prevents artifacts caused by the edge contours of the original garment. The construction process of the obtained mask

M_{inpaint}

is illustrated in the lower left corner of the figure. Additionally, a DWPose skeleton map

I_{pose}

is added to the input conditions to compensate for the loss of pose information caused by the mask and to enhance the rendering of hand skin.

Figure 1. Overview of MP-VTON. Based on SDIM, we improve the masking rule and incorporate the pose-guided enhancement. The key optimization of the masking rule lies in the use of rectangle-based mask construction combined with mask refinement utilizing Human Parsing

I_{h}

and Densepose

I_{s}

, which effectively prevents artifacts caused by the edge contours of the original garment. The construction process of the obtained mask

M_{inpaint}

is illustrated in the lower left corner of the figure. Additionally, a DWPose skeleton map

I_{pose}

is added to the input conditions to compensate for the loss of pose information caused by the mask and to enhance the rendering of hand skin.

Figure 3. The comparison between model parameter size and generation performance is illustrated. Each method is represented by a circle, with the area of the circle corresponding to the size of its parameters. Our student model, LiteMP-VTON, is depicted in orange.

Figure 4. Comparison of the mask results between the previous method in DCI-VTON and the method we proposed.

Figure 5. The comparison of the generation results of different virtual try-on methods. Regions with poor generation quality are highlighted with red bounding boxes.

Figure 6. Improvement of hand generation quality through pose-guided enhancement. (b) is a zoom-in of the hands in (a), where pose-guided enhancement effectively enhances the hand generation quality of the distilled small model.

Table 1. Quantitative comparison of different virtual try-on methods. The optimal metric is highlighted in bold. An upward arrow (↑) indicates that a higher value denotes better performance, while a downward arrow (↓) means a lower value is better. This notation is used consistently throughout the following tables.

Model	LPIPS ↓	SSIM ↑	FID ↓	KID ↓	Parameter
VITON-HD	0.116	0.862	12.12	0.32	100 M
HR-VITON	0.104	0.878	11.27	0.27	100 M
DCI-VTON	0.081	0.880	8.76	0.11	1027 M
MP-VTON (ours)	0.078	0.887	8.73	0.11	1027 M
Teacher Model (w/o CLIP)	0.081	0.879	8.76	0.12	898 M
Student Model (LiteMP-VTON)	0.090	0.870	9.78	0.17	286 M

Table 2. Ablation experiments on the VITON-HD dataset. A denotes the baseline method using SDIM; B denotes the improved masking rule; C denotes the pose-guided enhancement condition; D denotes the CLIP condition.

Methods	Teacher Model				Student Model
	LPIPS ↓	SSIM ↑	FID ↓	KID ↓	LPIPS ↓	SSIM ↑	FID ↓	KID ↓
A	0.075	0.853	11.04	0.25	0.183	0.764	15.6	0.51
A + B	0.090	0.873	9.12	0.21	0.121	0.857	11.7	0.31
A + B + C	0.081	0.879	8.76	0.12	0.090	0.870	9.8	0.17
A + B + C + D	0.078	0.887	8.73	0.11	-	-	-	-

Table 3. Quantitative comparison of the generation performance of student models obtained from different distillation schemes.

Case	LPIPS ↓	SSIM ↑	FID ↓	KID ↓
Case 1	0.135	0.576	23.5	0.54
Case 2	0.102	0.683	16.6	0.42
Case 3	0.095	0.721	13.2	0.31
Case 4	0.090	0.870	9.8	0.17

Table 4. Comparison of GFLOPs, inference time, and memory usage in the distillation scheme.

Model	GFLOPs	Inference Time (s)	Memory Usage (MB)
Teacher Model	548.34	2.39	6713
Student Model	238.24	1.08	2211

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Wang, L.; Ding, W. LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On. Information 2025, 16, 408. https://doi.org/10.3390/info16050408

AMA Style

Zhang S, Wang L, Ding W. LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On. Information. 2025; 16(5):408. https://doi.org/10.3390/info16050408

Chicago/Turabian Style

Zhang, Shufang, Lei Wang, and Wenxin Ding. 2025. "LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On" Information 16, no. 5: 408. https://doi.org/10.3390/info16050408

APA Style

Zhang, S., Wang, L., & Ding, W. (2025). LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On. Information, 16(5), 408. https://doi.org/10.3390/info16050408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LiteMP-VTON: A Knowledge-Distilled Diffusion Model for Realistic and Efficient Virtual Try-On

Abstract

1. Introduction

2. Related Works

2.1. Image-Based Virtual Try-On

2.2. Knowledge Distillation

3. Methods

3.1. Primary Knowledge

3.2. Framework

3.3. MP-VTON

3.4. LiteMP-VTON

3.4.1. Teacher Model

3.4.2. Student Model

3.4.3. Cross-Attention Alignment Distillation Module

3.4.4. Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Quantitative Evaluation

4.3. Qualitative Evaluation

4.4. Ablation Experiments

4.5. Comparative Experiments on the Distillation Process

4.6. Inference Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI