Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations

Kachbal, Ilham; El Abdellaoui, Said

doi:10.3390/info17010011

Open AccessSystematic Review

Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations

by

Ilham Kachbal

^*

and

Said El Abdellaoui

LAPSSII—Laboratory of Processes, Signals, Industrial Systems, and Computer Science, Higher School of Technology, Cadi Ayyad University, B.P. 89, Safi 30000, Morocco

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 11; https://doi.org/10.3390/info17010011

Submission received: 10 November 2025 / Revised: 6 December 2025 / Accepted: 11 December 2025 / Published: 23 December 2025

Download

Browse Figures

Versions Notes

Abstract

The convergence of fashion and technology has created new opportunities for creativity, convenience, and sustainability through the integration of computer vision and artificial intelligence. This systematic review, following PRISMA guidelines, examines 200 studies published between 2017 and 2025 to analyze computational techniques for garment design, accessories, cosmetics, and outfit coordination across three key areas: generative design approaches, virtual simulation methods, and personalized recommendation systems. We comprehensively evaluate deep learning architectures, datasets, and performance metrics employed for fashion item synthesis, virtual try-on, cloth simulation, and outfit recommendation. Key findings reveal significant advances in Generative adversarial network (GAN)-based and diffusion-based fashion generation, physics-based simulations achieving real-time performance on mobile and virtual reality (VR) devices, and context-aware recommendation systems integrating multimodal data sources. However, persistent challenges remain, including data scarcity, computational constraints, privacy concerns, and algorithmic bias. We propose actionable directions for responsible AI development in fashion and textile applications, emphasizing the need for inclusive datasets, transparent algorithms, and sustainable computational practices. This review provides researchers and industry practitioners with a comprehensive synthesis of current capabilities, limitations, and future opportunities at the intersection of computer vision and fashion design.

Keywords:

computer vision; fashion technology; deep learning; generative models; garment synthesis; virtual try-on; cloth simulation; recommender systems; systematic review; artificial intelligence

Graphical Abstract

1. Introduction

The fashion industry has undergone significant transformation in recent years due to technological advancements and shifting social preferences [1]. The emerging field of computer vision is profoundly impacting the fashion industry through innovative generative, simulative, and recommender techniques [1]. Generative models leverage vast datasets to enable virtual synthesis, creating novel designs without physical constraints [2]. This synthesis facilitates unprecedented virtual try-on experiences using only images as inputs [2,3]. Simulative systems apply computational principles from physics to digitally simulate how virtual garments behave on 3D body models. By realistically modeling fabric drape and movement, these simulations bring virtual designs to life [4,5]. They also enable immersive virtual and augmented reality try-on experiences. Recommender systems harness pattern recognition of past behaviors to generate hyper-personalized recommendations [6,7]. By understanding consumer preferences from purchases, browsing history, and social media activity, these systems can suggest customized outfits tailored to each individual. These generative, simulative, and recommender techniques are revolutionizing fashion industry processes, from design and prototyping to virtual sample rooms and targeted marketing. They offer a vision of a more sustainable, accessible, and customized future for the industry through the synergy of creativity and advanced computer vision technologies [8,9].

Multiple surveys [1,8,10,11,12,13] and reviews have addressed the complex relationship between fashion and artificial intelligence (AI), with each study categorizing fashion methodologies in distinct ways that vary considerably across the literature. This diversity in classification approaches highlights the multifaceted nature of AI’s impact on the fashion industry, showcasing an evolving landscape where technology revolutionizes not only fashion creativity but also marketing, production, and consumption. However, existing surveys focus on isolated subdomains without examining cross-domain synergies or providing systematic reproducibility resources, motivating this unified review spanning generative, simulative, and recommender systems with comprehensive implementation guidance. In 2021, Cheng et al. [1] presented an extensive examination of the primary applications of computer vision methodologies within the fashion domain, classifying these applications into four key dimensions: fashion detection, analysis, synthesis, and recommendation. For each dimension, the authors examined the latest methods, benchmark datasets, and common metrics employed in previous studies. Mohammadi et al. [10] presented a review of artificial intelligence applications in the fashion and apparel industry, collectively referred to as "smart fashion". The authors discussed how AI is becoming integrated across the industry for applications such as feature extraction, classification, attribute recognition, and recommendation systems. The paper also provided an overview of different modeling techniques used in image-based try-on systems, including 2D modeling, 3D modeling, and image-based rendering. Gong et al. [11] discussed the use of deep learning techniques for fashion recommendation systems, emphasizing how leveraging aesthetics and personalization is crucial for improving accuracy. The End-to-End Deep Learning Approach on Set Data proposes a data-centric strategy for training a model capable of autonomously generating appropriate fashion ensembles. This methodology draws inspiration from the escalating prevalence of online fashion trends, notably observable on platforms such as Pinterest and YouTube. Ma et al. [14] provided a comprehensive overview of the emerging field of deep learning-based makeup style transfer, reviewing various works published between 2016 and 2021 and analyzing the diverse generative models and techniques proposed. Popular approaches investigated include GANs [15], CNNs [16], and transformer-based methods [17]. Models such as BeautyGAN [18], PairedCycleGAN [19], and PSGAN [20], which leverage techniques like style transfer, color harmonization, and face parsing, are discussed. In 2023, Deldjoo et al. [8] offered a comprehensive overview of contemporary fashion recommender systems, discussing the challenges faced by these systems, such as fashion item representation, compatibility, personalization, fit, interpretability, and trend discovery. The utilization of various data sources and computational features, including image and text representation, social network features, and user–item attributes, is crucial for addressing these challenges. Also in 2023, Liu et al. [12] discussed the importance of understanding and predicting fashion trends, popularity, and user preferences, as well as recent advancements in makeup transfer and virtual try-on technologies. The article highlighted various deep learning systems that utilize GANs for virtual try-on capabilities and provided comprehensive coverage of AI applications in fashion analysis, recommendation, and synthesis. Our focus extends beyond highlighting prominent works in the field; we also aim to encompass any pertinent contributions. This approach enables us to shed light on potential innovations that may have been overlooked and facilitates an extensive analysis of fashion evolution over the years, encompassing a broad spectrum of research. To achieve this, we selected articles published between 2017 and 2025, allowing for occasional deviations from this timeframe. To organize this vast body of research, we categorized these articles into various application classes using a multi-label scheme. With this method, an individual article can be allocated to multiple application types if it addresses several relevant facets. Figure 1 illustrates these categories. Additionally, we categorize each article under a specific application category only if it explicitly presents significant findings related to that particular application. In summary, our work’s contributions can be encapsulated as follows:

We present a comprehensive examination of the latest advancements in fashion research, categorizing research topics into three main domains: generative, simulative, and recommender.
Within each of these intelligent fashion research categories, we offer a comprehensive and well organized review of the most notable methods and their corresponding contributions.
We compile evaluation metrics tailored to diverse problems and provide performance comparisons across various methods.
We delineate prospective avenues for future exploration that can foster further progress and serve as a source of inspiration for the research community.

In this approach, we employ a categorization method for articles based on their applications, aligning with the taxonomy detailed in Cheng et al. [1]. These applications are classified into three primary classes: Low-Level Fashion Recognition, Mid-Level Fashion Understanding, and High-Level Fashion Applications. Importantly, overlaps can exist between these categories. For instance, higher-level applications may incorporate mid-level or a combination of multiple low-level functionalities, such as try-on applications that may also involve parsing, labeling, classification, detection, and other functions. When addressing complex topics that involve summarizing and comparing various research works, a thorough understanding of the core concepts, techniques, and frameworks in the domain is essential. One effective method is to actively compile keywords that are frequently discussed in influential papers. This set of keywords serves as a knowledge repository, enabling us to grasp new research papers more rapidly by identifying familiar concepts. Additionally, it empowers us to respond to inquiries more effectively by recognizing papers that contain pertinent information.

This systematic review is organized into the following sections. Section 2 presents our PRISMA methodology, including search strategy, inclusion criteria, and data extraction procedures. Section 3, Section 4 and Section 5 examine the three primary domains: generative fashion (synthesis and virtual try-on), simulative fashion (physics-based modeling and real-time rendering), and recommender fashion (personalized suggestions and compatibility learning), with comprehensive analysis of technological evolution, performance metrics, and state-of-the-art methods. Section 6 compiles open-source implementations, pre-trained models, and benchmark datasets to facilitate reproducibility. Section 7 analyzes synergistic integration across the three domains, demonstrating how advances in one area enable progress in others. Finally, Section 8 summarizes main findings, identifies future research directions, and discusses ethical considerations for responsible AI deployment in fashion applications.

2. Methodology Overview

The PRISMA protocol [21] was used in this systematic review, as it provides a structured and widely recognized framework for systematically identifying, evaluating, and synthesizing research in computer vision for fashion. The review process consists of five steps: (1) commissioning an examination; (2) document identification; (3) primary study selection; (4) backward and forward search execution; and (5) data extraction with progress tracking.

2.1. Research Questions

This systematic review analyzes computer vision applications in fashion through five essential research questions that examine current developments, research methods, and future prospects of the field.

RQ1: What are the current state-of-the-art generative computer vision techniques applied in fashion applications? This question evaluates generative methods, including Generative Adversarial Networks (GANs), diffusion models, and style transfer techniques for fashion design synthesis, texture generation, and creative fashion content creation.

RQ2: How are simulative computer vision methods being utilized in fashion domains, and what is their effectiveness? This question investigates virtual try-on systems, 3D garment simulation, and physics-based cloth modeling to evaluate their realistic features, computational efficiency, and practical applicability.

RQ3: What computer vision-based recommender system approaches are employed for fashion recommendation? This question investigates recommendation systems that utilize visual feature extraction, style analysis, and outfit compatibility assessment through content-based and hybrid approaches.

RQ4: What are the dominant evaluation metrics, benchmark datasets, and performance standards used across these applications? This question investigates methodological aspects by analyzing datasets, metrics, and standards for evaluating various approaches.

RQ5: What are the key challenges, limitations, and future research directions in computer vision for fashion applications? This question addresses current technical obstacles, methodological problems, and potential opportunities in this field.

2.2. Search Strategy

The first stage required literature identification through Scopus, Web of Science, IEEE Xplore, ACM Digital Library, and other databases. These databases were specifically chosen for their broad coverage of computer science, computer vision, artificial intelligence, and fashion technology publications, ensuring access to both established and emerging research in interdisciplinary areas. The initial search string used was as follows:

("computer vision" OR "deep learning" OR "machine learning" OR "neural network*" OR "CNN" OR "GAN" OR "generative adversarial network*" OR "transformer*" OR "diffusion model*") AND ("fashion" OR "clothing" OR "apparel" OR "garment*" OR "textile*" OR "style" OR "outfit*" OR "fashion design") AND ("generative" OR "simulation" OR "recommend*" OR "virtual try-on" OR "synthesis" OR "style transfer" OR "fashion generation" OR "personalization" OR "virtual fitting")

The search focused on titles, abstracts, and keywords across the databases for the period between 2017 and 2025. This timeframe enables examination of the rapid development of deep learning applications in fashion while capturing recent advances in computer vision technology. To ensure consistency across databases, we adapted the search syntax according to each platform’s specific requirements while maintaining the semantic equivalence of the query terms.

2.3. Study Selection

The study selection process is illustrated in the PRISMA flow diagram (Figure 2). The initial search yielded 6600 articles from six databases. The first screening phase produced 3435 articles after removing duplicate and ineligible records. The title and abstract review process resulted in the exclusion of 3170 articles from the 3435 screened articles.

Assessment of the remaining 230 full-text articles led to the exclusion of 30 articles that did not meet the inclusion criteria, resulting in 200 studies for the final analysis. The selection process included only high-quality studies that focused on computer vision applications in fashion. Our systematic review demonstrates robust representation across high-impact academic venues based on comprehensive bibliographic analysis of 200 publications. The publications follow this distribution pattern: IEEE (29.1%), ACM (15.5%), and Elsevier/Scopus (14.6%), demonstrating that the engineering, technology, and computing fields have made the most significant research contributions. Additionally, the Web of Science database contains 33.0% of the peer-reviewed literature, while specialized venues and other sources comprise 7.8% of the collection.

2.4. Inclusion and Exclusion Criteria

Studies were included if they met the following criteria: (1) focused on computer vision applications in fashion domains, (2) addressed generative methods, simulation techniques, or recommendation systems, (3) were published in peer-reviewed venues between 2017 and 2025, (4) were written in English, and (5) presented empirical results or novel methodologies. Studies were excluded if they: (1) lacked technical depth or empirical evaluation, (2) focused solely on traditional non-deep learning computer vision methods, or (3) were not accessible in full text.

2.5. Data Extraction

Data were systematically extracted from each included study using a standardized data extraction form. Two reviewers (Authors 1 and 2) independently extracted the following information:

Publication details (authors, year, venue, publisher)
Application domain (generative, simulative, or recommender systems)
Computer vision techniques and architectures employed
Datasets used for training and evaluation
Evaluation metrics and performance results
Key findings and contributions
Reported limitations and future research directions

Discrepancies between reviewers were resolved through discussion. When information was unclear or missing, the original publications were consulted, and if necessary, Supplementary Materials were reviewed.

2.6. Quality Assessment

Each included study was assessed for methodological quality using criteria adapted for computer vision research:

Clarity of research objectives: Clear problem definition and research goals
Methodological rigor: Appropriate choice of computer vision techniques and experimental design
Evaluation validity: Use of standard datasets, appropriate metrics, and comparison with baselines
Reproducibility: Sufficient implementation details and availability of code or models
Significance of contribution: Novel approaches or substantial improvements over existing methods

Studies that did not meet minimum quality thresholds across these criteria were excluded during the full-text assessment phase. Quality assessment was performed independently by two reviewers, with disagreements resolved through consensus.

2.7. Data Synthesis

Due to the heterogeneity of methods, datasets, and evaluation metrics across the included studies, a qualitative narrative synthesis approach was adopted. Studies were grouped according to their primary application domain (generative, simulative, or recommendation systems) and analyzed thematically. For each domain, we synthesized the computer vision techniques employed, performance characteristics, and common challenges. Quantitative meta-analysis was not feasible due to the diverse nature of the applications and evaluation approaches.

3. Generative Fashion

Generative fashion through artificial intelligence produces digital textile and garment designs by leveraging computational creativity with generative adversarial networks (GANs) [15], variational autoencoders (VAEs) [22], and 3D parametric modeling [23]. This technology encompasses two main applications: textile pattern synthesis and complete outfit generation. Textile pattern synthesis employs GANs and style transfer methods to generate novel textures, colors, and prints that expand creative possibilities for fabric design and surface decoration. Through outfit generation (Figure 3) [24], these systems produce coordinated ensembles by analyzing garment compatibility and evaluating design elements such as silhouette, drape, and textile properties.

Beyond design generation, these AI-driven systems provide personalized styling recommendations tailored to individual preferences, body types, and specific occasions, demonstrating the transformative potential of artificial intelligence in the fashion and textile industry. The creation of complete outfits requires deep learning models to generate harmonious multi-garment ensembles that account for textile characteristics, garment construction, and human body morphology. These advanced systems capture semantic relationships between clothing items, analyze contemporary fashion trends and personal style preferences, and adhere to principles of garment coordination, textile compatibility, and fit optimization. The following section provides a comprehensive evaluation of deep learning methods for generative fashion developed through 2025. We discuss more than a dozen papers and their proposed methods. The comparison focuses on quantitative aspects such as model architecture, performance on benchmark datasets, types of synthesis supported, and other relevant factors. Comparisons of datasets, metrics, and results across methods are presented in tables and figures.

3.1. Fashion Synthesis

The fashion synthesis process relies on machine learning and computer vision techniques to create and style fashion designs through automated algorithms. Through advanced generative models such as GANs, fashion synthesis techniques [15,25,26] are capable of digitally creating photorealistic images of garments, accessories, and complete outfits conditioned on different attributes. Designers can either generate new design concepts or modify existing styles through these methods. Several approaches have been proposed to address various fashion synthesis tasks, including generating novel garment designs, transferring styles between items, creating coordinated textile patterns, and generating matched outfits.

3.1.1. Clothing Synthesis

Methods based on Generative Adversarial Networks (GANs) have found extensive application in tasks related to garment image synthesis. GANs [15] have gained significant popularity in diverse image-generation tasks owing to their capability to produce highly realistic images. Not surprisingly, several early works on fashion synthesis also employed GANs for this task. The training objective can be formulated as a minimax game where the generator G learns to produce synthetic images that fool the discriminator D, while the discriminator learns to distinguish real images from generated ones. The generator minimizes the discriminator’s ability to classify correctly, while the discriminator maximizes its classification accuracy.

This adversarial process is expressed mathematically as:

\min_{G} \max_{D} E_{x \sim p_{d a t a} (x)} [\log D (x)] + E_{z \sim p_{z} (z)} [\log (1 - D (G (z)))]

(1)

where x represents real fashion images from the data distribution

p_{d a t a}

, z is a random noise vector sampled from prior distribution

p_{z}

,

G (z)

generates synthetic images, and

D (x)

outputs the probability that x is real. For conditional fashion synthesis, the model is extended to Conditional GAN (cGAN) where generation is guided by additional control signals. Both generator and discriminator are conditioned on auxiliary information y such as garment category (dress, shirt, pants), target pose (standing, sitting), or desired texture (denim, silk). This conditioning allows precise control over generated garment attributes. The conditional formulation modifies the objective to:

\min_{G} \max_{D} E_{x \sim p_{d a t a} (x)} [\log D (x | y)] + E_{z \sim p_{z} (z)} [\log (1 - D (G (z | y) | y))]

(2)

The Conditional Analogy GAN (CA-GAN) emerged as one of the first GAN-based models for fashion synthesis when Jetchev et al. [27] presented it in 2017. CA-GAN utilized paired fashion images to learn analogical relationships and generated corresponding garment swaps in images of people. The model produced encouraging qualitative results when exchanging fashion items while maintaining the person’s overall visual appearance.

In 2018, Xian et al. [28] proposed incorporating textile texture features as an additional input to GANs for more fine-grained control over fashion generation. The authors trained their model, called TextureGAN, which allowed users to control deep image synthesis with texture patches and demonstrated controllable synthesis based on sketch, color, and texture guidance. The main limitation was that users needed to specify the exact placement of texture patches during the generation process. Abdellaoui and Kachbal [29] introduced a novel approach to image matting for fashion e-commerce, utilizing deep learning-based models to extract foreground and background colors and alpha matte from images. The method addresses the challenges of extracting high-quality objects from images through the combination of supervised and unsupervised networks in fashion e-commerce applications. The authors developed deep residual networks for high-resolution background matting [30,31], which effectively separate garment regions from images. The advancement of fashion synthesis research depends heavily on large-scale fashion datasets, which are presented in Table 1. The initial datasets DeepFashion [32] and Fashion-MNIST [33] contained hundreds of thousands of garment images with categorical annotations for retrieval and classification tasks. The datasets CLOTH3D [34], Fashion-Gen [35], and VITON-HD [3] (2018–2021) introduced 3D pose, text descriptions, and high-resolution virtual try-on capabilities. Fashionpedia [36] included detailed attribute annotations and segmentation masks, while ModaNet [37] contained street fashion images with polygon annotations for precise fashion parsing. The Conditional Analogy GAN for fashion image synthesis was developed by Jetchev et al. [27] in 2017, which enabled garment swapping in images of people. The research demonstrated structured fashion design generation through texture-guided methods with spatial constraints, producing more realistic garment transfers than existing methods, as validated through user studies. One of the earliest GAN-based methods was DisPG [38], which focused specifically on generating new poses and physiques of garments on people through disentangled person image generation. VITON [2] achieved better disentanglement of factors related to pose, appearance, and shape using a coarse-to-fine strategy for virtual try-on. Later works such as CP-VTON+ [39] and ACGPN [40] generated even more diverse and realistic outputs while better disentangling attributes like color and style. While GANs produced impressive results, they also suffered from well-known issues such as model instability during training and a lack of control over specific attributes. Figure 4 provides a quantitative comparison of key GAN-based methods. As shown in Figure 4a, the field has witnessed substantial progress, with FID scores improving by 54% (from 27.84 to 12.88) while Inception Scores increased by 31% (from 3.49 to 4.60), demonstrating consistent quality enhancement across multiple evaluation metrics. The inverse relationship between these metrics validates the dual advancement in both fidelity and diversity of generated fashion images. Parameter efficiency analysis (Figure 4b) reveals that model performance does not scale linearly with parameter count; CP-VTON achieves competitive results (FID = 23.55) with only 30 M parameters, while methods exceeding 50 M parameters show diminishing returns, suggesting that architectural design is as crucial as model capacity. Resolution analysis (Figure 4c) demonstrates that increasing from low (256 × 192) to ultra-high (1024 × 1024) resolution yields 36% FID improvement, though this comes at significantly higher computational costs. The multi-dimensional comparison (Figure 4d) highlights the inherent trade-offs: StyleGAN-Human excels in quality and realism but sacrifices efficiency, whereas CP-VTON maintains high efficiency at moderate quality levels, indicating that method selection should be application-dependent based on whether priority is given to visual fidelity or computational efficiency.

To address these limitations, later works explored conditional GANs and additional losses and constraints. To better control the generation process and attribute disentanglement, several works incorporated conditional information into the GAN framework. As shown in Figure 4, VITON-HD [3] and CP-VTON [43] incorporated categorical conditional information to constrain generation, producing more fine-grained outputs that preserved the original metadata. ACGPN [40] and VITON [2] focused explicitly on pose manipulation by conditioning on target pose maps, which allowed for refining and editing poses in a more precise manner. Overall, conditional GAN approaches improved attribute disentanglement and controllability compared to earlier unconditional GAN methods for fashion generation. Incorporating conditional information enhances the interpretability of generated fashion items and provides a more structured framework for users or designers to guide the synthesis process. In 2019, Dong et al. [44] developed FE-GAN, which introduced adversarial parsing learning, enabling more precise control over diverse fashion image manipulations. FE-GAN demonstrated improved attribute consistency compared to prior work through free-form sketches and color strokes. Other contributions focused on enhancing realism through refinement networks [45], modeling textile texture patterns [46], and capturing garment dynamics. Progress accelerated with significant advances each year in quality and control capabilities. More recent work has aimed to address key lingering limitations, such as lack of diversity, inability to smoothly manipulate generation attributes, and scaling to higher resolutions critical for practical applications. In 2021, Lewis et al. [47] developed TryOnGAN, incorporating body-aware layered interpolation to enable intuitive control between different garment combinations for generating varied, nuanced virtual try-on images. It achieved state-of-the-art performance on virtual try-on benchmarks. Patashnik et al. [48] introduced StyleCLIP, which allows text-driven manipulation of StyleGAN imagery through CLIP embeddings for controlling garment style, color, and attributes individually or jointly. It demonstrated fine-grained control over fashion attribute manipulation quantitatively. A major breakthrough arrived in 2022 when Fu et al. [49] presented StyleGAN-Human, capable of synthesizing incredibly high-resolution 1024 × 1024 pixel human images with fashion details like clothing textures and accessories clearly resolved. StyleGAN employs Adaptive Instance Normalization (AdaIN) to inject style information at each layer of the generator. AdaIN first normalizes feature activations to zero mean and unit variance, then applies learned affine transformations controlled by style vectors. This mechanism allows fine-grained control over different visual attributes at different scales—coarse layers control pose and overall shape, while fine layers control color and texture details. The AdaIN operation at each convolutional layer is defined as:

AdaIN (x_{i}, y) = y_{s, i} \frac{x_{i} - μ (x_{i})}{σ (x_{i})} + y_{b, i}

(3)

where

x_{i}

is the feature map of layer i,

μ (x_{i})

and

σ (x_{i})

are the mean and standard deviation computed spatially, and

y = (y_{s}, y_{b})

represents style vectors derived from a mapping network that transforms latent code

z \in Z

to intermediate latent space

w \in W

:

w = f (z), z \sim N (0, I)

(4)

The generator synthesizes images progressively through multiple resolution levels, with the final output at resolution r given by:

I_{r} = g_{r} (AdaIN (x_{r - 1}, w_{r}))

(5)

where

g_{r}

represents the convolutional block at resolution r, and

w_{r}

is the style vector for that level. StyleGAN-Human utilized data-centric engineering approaches with large-scale datasets to ensure quality was maintained across resolutions, representing a significant advancement over previous human generation methods. This work established new benchmarks for unconditional human generation with diverse poses and garments. Most recently, Lewis et al. [47] developed TryOnGAN, which introduced body-aware layered interpolation to enable intuitive control over garment transfer and synthesis. TryOnGAN achieved superior performance compared to previous benchmarks on established metrics through its training with high-resolution datasets while learning interpretable latent representations that enabled semantic manipulations. The resolution and quality metrics (Figure 4c) show significant yearly improvements from VITON’s 256 × 192 resolution to StyleGAN-Human’s [49] 1024 × 1024 generations, which were validated quantitatively. The first work operated at low resolution; however, StyleGAN-Human [49] achieved state-of-the-art quality at the cost of high-end GPU requirements. The ACGPN model [40] generates output at reasonable speeds due to its efficient multi-stage architecture that runs on standard GPU configurations. Further optimizations are clearly still needed for interactive applications.

Fashion synthesis has primarily focused on modeling and generating garments and apparel. The development of these techniques creates possibilities for applying synthesis methods in adjacent fields, including beauty and cosmetics. Facial appearance strongly affects overall fashion presentation and style expression. While the field of garment synthesis has gained substantial attention, jewelry synthesis remains relatively underexplored due to additional challenges in modeling intricate textures, small-scale materials, and fine geometric details. Some recent works have made progress in this domain through specialized texture modeling approaches.

3.1.2. Jewelry Synthesis

Jewelry synthesis utilizes computer vision technology to create photorealistic images and 3D models of jewelry items, including rings, necklaces, bracelets, and gemstones. Traditional jewelry design requires skilled artisans and extensive manual labor; however, jewelry synthesis employs deep learning and generative models to create new designs algorithmically at scale, which could revolutionize the industry by accelerating design processes, enabling mass customization, and reducing production costs [50,51]. Recent developments in 3D point cloud processing [52,53] and sparse convolution networks [54] demonstrate potential for handling intricate geometric structures; however, material representation and multi-scale detail synthesis remain significant challenges. Several deep learning studies published between 2019 and 2024 have advanced the field of computational 3D object generation and synthesis. TreeGAN [55] (Figure 5a), published in 2019, was an early method that could generate novel 3D point cloud shapes using tree-structured graph convolutions for hierarchical feature learning. The following year, PC-GAN [56] focused specifically on adversarial learning for point cloud generation with improved structural coherence. In 2022, Point-E [57] introduced a transformer-based approach that combines text-to-image diffusion with image-to-point cloud generation for rapid 3D synthesis. More recently, NPCD [58] introduced neural point cloud diffusion to synthesize high-quality 3D objects with disentangled shape and appearance control. As this research area further develops, increasingly advanced generation, personalization, and interactive experiences are expected to transform digital 3D design and applications. Most recently, in 2024, NPCD [58] introduced a neural point cloud diffusion approach to generate diverse and novel 3D objects with disentangled shape and appearance control represented as point clouds with high-dimensional features. These studies demonstrate the growing applications of deep generative models for 3D object synthesis, moving beyond static 3D shapes to include dynamically synthesized textures, multi-modal conditioning, and point cloud-based geometry generation. The studies employed different datasets and evaluation strategies (Figure 5b) to assess the quality of their generated 3D objects. TreeGAN [55] trained and tested on ShapeNet, reporting a Fréchet Inception Distance (FID) score of 43.2 and an Inception Score of 2.8 for tree-structured point cloud generation. Point-E [57] leveraged text-to-image diffusion for 3D generation using ShapeNet, achieving an FID of 28.4 and an IS of 3.1. PC-GAN [56] used the ModelNet40 dataset and obtained an FID of 35.7 on 3D object classification tasks. NPCD [58] synthesized 3D objects from PhotoShape, achieving the best FID of 22.6 as well as a high Inception Score (IS) of 3.4 through its neural point cloud diffusion approach. While FID measures resemblance to real data distributions, accuracy metrics evaluate specific tasks such as classification or geometric reconstruction quality.

3.1.3. Facial Synthesis

The field evolved from simple makeup transfer to advanced diffusion-based techniques, which provided better control and more realistic results. Table 2 presents the main methods employed from 2020 to 2024. The early generative approaches of Feng et al. [59] were developed on small datasets with a focus on visual quality, while Bougourzi et al. [60] proposed CNN ensemble methods that achieved 91.2% MAE on SCUT-FBP5500 for facial beauty assessment. The field shifted to diffusion methods in 2024 when Zhang et al. [61] presented Stable-Makeup with detail-preserving encoders and cross-attention mechanisms, and Sun et al. [62] developed SHMT for self-supervised hierarchical makeup transfer using latent diffusion models without pseudo-paired training data, achieving unprecedented realism and controllability.

Current methods [63,64] achieve high-accuracy facial makeup detection using deep learning. Lebedeva et al. [65] demonstrated BeautyNet’s significant improvements through multi-scale CNN architectures with transfer learning. Recent self-supervised and weakly supervised strategies reduce the need for labeled datasets, as shown in adversarial face synthesis methods [66]. Makeup style analysis and transfer have direct applications in virtual cosmetic try-ons, as demonstrated by adversarial makeup transfer methods that protect facial privacy while maintaining visual quality [67]. Fine-grained makeup attribute detection serves as an important prior for controlled face generation tasks. Future potential includes personalized makeup recommendation systems becoming mainstream in virtual fashion and beauty applications.

The results of four adversarial attack methods appear in Table 3, which includes AdvFaces [66], GFLM [68], PGD [69], and FGSM [70] applied to five face recognition systems: FaceNet [71], SphereFace [72], ArcFace [73], and commercial models COTS-A and COTS-B. The table measures attack success rates showing prediction changes after perturbation. AdvFaces achieved success rates exceeding 60% against all models with 97.22% peak performance [66]. PGD and FGSM performed well with rates above 30%. GFLM proved computationally efficient while maintaining high success rates [68], whereas ArcFace showed greater robustness. AdvFaces achieved the highest structural similarity scores and computational efficiency. The computational time analysis reveals significant differences between methods, with AdvFaces and FGSM requiring minimal processing time (0.01 s and 0.03 s, respectively), while PGD demands substantially more resources at 11.74 s per attack. These timing disparities highlight the trade-off between attack sophistication and practical deployment constraints in real-world scenarios. The structural similarity measurements demonstrate that AdvFaces maintains visual coherence with scores of 0.95 ± 0.01, suggesting that perturbations remain largely imperceptible to human observers while still achieving high attack success rates. Commercial systems COTS-A and COTS-B exhibited varying degrees of vulnerability, with COTS-B showing improved resistance particularly against PGD attacks, indicating potential incorporation of adversarial defense mechanisms in newer commercial implementations.

3.2. Outfit Generation

The process of outfit generation (Figure 6) utilizes computational techniques to produce garment arrangements that follow fashion principles. Early methods applied basic compatibility rules through pairwise item matching [74]; however, these methods did not capture how multiple items relate to each other or how to represent visual elements effectively. Sequential modeling approaches brought progress through bidirectional LSTMs [75], which treated outfits as sequences to learn compatibility. Neural generative models experienced rapid growth, generating photorealistic outfits from different input conditions [76,77] (Figure 7). The following section evaluates current techniques by comparing datasets, models, generation modalities, attribute control capabilities, and limitations through representative approaches and visual comparison tables, tracking technological progress to identify breakthroughs and current challenges for future research directions.

3.2.1. Outfit Generative Models

Current deep learning methods enable generative models to learn intricate multi-item relationships from extensive outfit datasets. These methods learn to represent fashion items through low-dimensional embeddings that maintain compatibility relationships between them. The research of Vasileva et al. [76] combined type-aware embeddings with Polyvore outfit triplet losses through generalized distance metrics, while Han et al. [75] developed end-to-end learning of visual-semantic representations and compatibility relationships. Sequential generative models demonstrate potential through conditional processes by using bidirectional LSTMs to predict next-item distributions, treating outfits as top-to-bottom sequences. Vasileva et al. [76] employed type-aware embeddings with triplet losses. For a category c, the embedding function

f_{c} : I \to R^{d}

maps fashion items to a d-dimensional space where compatible items are close. To learn type-aware embeddings where compatible fashion items are close together in embedding space, the model employs triplet loss. This loss function encourages the distance between an anchor item and compatible items to be smaller than the distance to incompatible items by at least a margin

α

. For each training triplet consisting of an anchor garment a (e.g., a blue shirt), a compatible positive item p (e.g., matching jeans), and an incompatible negative item n (e.g., conflicting pattern pants), the loss is:

L_{t r i p l e t} = \sum_{(a, p, n)} \max (0, | | f_{c} (a) - f_{c} {(p) | |}_{2}^{2} - | | f_{c} (a) - f_{c} {(n) | |}_{2}^{2} + α)

(6)

where a is an anchor item, p is a compatible positive item, n is an incompatible negative item, and

α

is the margin parameter.

For sequential modeling, Han et al. [75] implemented bidirectional LSTMs to model outfit composition as a conditional probability:

P (O) = \prod_{t = 1}^{T} P (i_{t} | i_{1}, \dots, i_{t - 1}, i_{t + 1}, \dots, i_{T})

(7)

where

O = {i_{1}, i_{2}, \dots, i_{T}}

represents an outfit with T items, and the conditional probability is computed using bidirectional LSTM hidden states:

P (i_{t} | context) = softmax (W [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}] + b)

(8)

where

\vec{h_{t}}

and

\overset{\leftarrow}{h_{t}}

are forward and backward LSTM hidden states.

For graph-based approaches, Cui et al. [78] represent outfits as subgraphs where nodes correspond to fashion items and edges encode compatibility relationships. Graph neural networks learn to score outfit compatibility by aggregating information across multiple hops in the fashion graph. After L layers of graph convolution that propagate compatibility signals between connected items, the final compatibility score aggregates all item embeddings through a readout function:

s (O) = MLP (READOUT ({h_{i}^{(L)} | i \in O}))

(9)

where

h_{i}^{(L)}

is the node embedding of item i after L graph convolution layers, and READOUT aggregates node embeddings into a graph-level representation. Nakamura et al. [79] demonstrated style extraction through autoencoders combined with bidirectional LSTMs to generate outfits while maintaining style control. Recent research on graph neural networks has focused on outfit compatibility through node-wise relationships, which enhances the modeling of complex multi-item interactions [78]. The expansion of datasets together with models creates opportunities for personalized outfit recommendations to revolutionize digital shopping interactions. Table 4 outlines key outfit-generation approaches employed by different models along with their unique characteristics. For example, earlier works adopted sequential models using bidirectional LSTMs to contextualize predictions [75].

Meanwhile, Vasileva et al. [76] proposed type-aware embeddings that jointly represented visual similarity and compatibility through category-specific projections. More recently, Cui et al. [78] introduced node-wise graph neural networks that represent outfits as subgraphs to capture complex item relationships efficiently. Graph-based models offer an alternative framework for capturing pairwise item relationships in fashion. Cui et al. pioneered the representation of outfits as subgraphs within a fashion graph, where each node represents a category and each edge represents interaction between categories, employing graph neural networks to predict compatibility scores [78]. Early works employed bidirectional LSTMs [75] and type-aware embeddings [76]. Recent developments include node-wise graph networks [78], hypergraph networks [80], matching mechanisms with graph approaches [77], and hierarchical fashion graph networks [81].

3.2.2. Outfit Style Transfer Models

Outfit style transfer aims to modify outfits while maintaining visual compatibility and coherence. Early work explored rule-based techniques to swap compatible items between categories. Hsiao et al. [82] introduced Fashion++, which proposes minimal adjustments to a full-body garment outfit that will have maximal impact on its fashionability, using deep image generation neural networks that learn to synthesize clothing conditioned on learned per-garment encodings. However, such approaches focused on outfit improvement rather than style transfer per se. Deep learning models now capture rich item and style representations to transfer attributes in a data-driven manner. The CMRGAN [83] model introduced GANs for garment generation based on cross-domain compatibility, training generators to produce compatible clothing while preserving compatibility relationships. However, early evaluations relied primarily on visual similarity metrics. GD-StarGAN [84] utilized multi-domain image-to-image translation for garment design, replacing StarGAN’s residual network with a U-Net architecture to improve convergence speed and stability. Leveraging multiple references, OutfitGAN [85] represents recent advances in outfit generation, using semantic alignment modules and collocation classification to generate compatible fashion items conditioned on existing items and reference masks. Attribute-GAN modeled style through semantic attributes, using a generator supervised by adversarial-trained collocation and attribute discriminators to generate garment pairs based on matching rules [86]. The evaluation process includes automatic metrics (FID, SSIM) together with human assessments, as shown in Table 5.

The FineGAN model achieved unsupervised hierarchical disentanglement for fine-grained generation through background, shape, and appearance separation to enable controlled generation [87] (Table 6). Li et al. [88] developed text-image embedding methods to generate coherent outfits for theme-based recommendations. FashionTex introduced direct attribute manipulation capabilities for semantic control of cloth types and textures through a method that eliminates the need for pairwise training [89]. The 3D Gaussian-based generation system of ClotheDreamer produces production-ready garments from text prompts through Disentangled Clothe Gaussian Splatting [90].

3.3. Challenges, Limitations, and Future Directions

The current state of generative models for fashion synthesis shows significant progress; however, multiple major challenges remain to achieve their full potential. The creation of detailed, realistic textures and complex design elements that appear in actual garments and accessories represents a significant technical barrier. The relationship between visual appearance and other modalities, such as text, requires further investigation to properly disentangle their dependencies in conditional generation. The advancement of generative fashion models raises substantial ethical concerns regarding deepfake technology and synthetic content creation, as noted by Westerlund et al. [91]. Realistic human figure garment generation through generative models presents a risk of creating deceptive or non-consensual content that users can exploit. Security protocols must be established to prevent unauthorized use of personal images and to mark synthetic content production explicitly. These systems may perpetuate unrealistic beauty standards while reinforcing existing biases in training data, thus affecting users’ body perception and self-worth [92]. The development of responsible systems depends on training datasets that represent diverse body shapes, different ethnicities, and various cultural fashion traditions. Generative fashion systems require access to sensitive user information, including measurement data and individual fashion preferences. System architectures should implement federated learning and differential privacy methods to preserve user privacy [93]. Users must maintain control over their personal data through explicit consent procedures and the ability to delete their information. The advancement of dynamic generation over time and at large product volumes creates three main technical challenges: modeling temporal relationships, maintaining model quality, and developing realistic virtual try-on through 3D body modeling and garment dynamics. These models must address output bias issues while preserving cultural sensitivity for global markets and enabling efficient deployment on edge devices.

Future Directions: The advancement of this field requires interdisciplinary collaboration to achieve three main goals: high-fidelity generation, robust bias detection with efficient personalization, and ethical guidelines for generative fashion. AI research should focus on building inclusive systems that promote diversity while respecting cultural norms and user privacy to achieve maximum societal benefits in fashion and garment design.

4. Simulative Fashion

Simulative fashion (Figure 8a) describes the process of using computational modeling and simulation techniques to create virtual garments and accessories through design and prototype testing. This approach allows creators to digitally experiment with intricate garment designs, materials, and fit without costly physical prototyping. With the exponential growth of computational power, simulative fashion is assuming an increasingly significant role in the design process. The Conditional Analogy GAN developed by Jetchev et al. [27] demonstrated early deep learning applications for fashion article swapping and virtual try-on systems. With the advent of immersive VR/AR technologies [94,95,96], simulative fashion is increasingly experienced in interactive, photorealistic 3D virtual environments. The software CLO3D (https://www.clo3d.com) allows designers and customers to visualize and try on custom outfits in digital fittings before manufacturing, as reported by Prasetya et al. [97]. Real-time garment editing and advanced cloth simulation further streamline the creative process. The following section examines two essential features of simulative fashion: virtual try-on capabilities and cloth simulation. These components play a major role in making simulative fashion transformative by allowing designers to visualize and evaluate their designs in virtual environments. The integration of physics-based simulation with real-time rendering capabilities has reached a level of sophistication where virtual garments exhibit realistic draping, stretch, and movement behaviors that closely mirror their physical counterparts. Furthermore, the convergence of artificial intelligence with cloth simulation is opening new possibilities for automated pattern generation, fabric property prediction, and intelligent design assistance that can guide creators toward more efficient and aesthetically pleasing solutions.

4.1. Virtual Try-On

Virtual Try-On (VTO) technology [98,99,100,101,102,103,104,105] has experienced rapid advances in recent years thanks to deep learning techniques. VTO allows users to virtually “try on” garment items and assess fit without physically trying on the garments. This technology has numerous potential applications in online shopping, gaming, social media, and beyond. In recent years, a multitude of deep learning techniques have been introduced for virtual try-on applications. This section provides a comprehensive overview of these methods and compares their architectures, inputs, techniques, advantages, limitations, and performance. The development of virtual try-on technology has been enabled by the release of large-scale datasets (Table 7) to train data-intensive models. Early works such as DeepFashion [32] released datasets containing over 800,000 images with rich annotations to perform tasks such as garment recognition and retrieval. VITON [2] introduced a dataset specifically for virtual try-on with 14,000 image pairs, while CP-VTON [43] provided additional benchmarks for characteristic-preserving virtual try-on with similar dataset sizes. The field evolved from smaller datasets that included 10,000–30,000 examples with detailed annotations. The VITON [2] dataset consists of 14,000 image pairs that utilize human parsing maps to generate clothing-agnostic representations. CP-VTON [43] used similar dataset sizes to maintain characteristic preservation through enhanced geometric matching techniques.

The flow-based approach of ClothFlow [106] generated clothed persons, while ACGPN [40] incorporated attention mechanisms together with semantic layouts. This research focused on developing geometric understanding capabilities and methods to preserve garment details. The VITON-HD [3] system achieved higher resolution while maintaining quality standards, and PF-AFN [107] eliminated the need for human parsing through parser-free methods. The evaluation metrics evolved from pixel-level assessments to structural similarity indices, perceptual metrics, and user studies for realistic virtual try-on evaluation.

4.1.1. Image-Based Methods

Earlier works on virtual try-on focused on image-based methods. These techniques (Table 8) take a garment image and a person’s image as input and synthesize a new image showing the garment worn by the person. Authors used a conditional generative adversarial network (GAN) to generate realistic images conditioned on the garment and person images, using an analogy-based approach for fashion article swapping. The subsequent work of VITON [2] presented a more advanced method through clothing-agnostic person representation and a coarse-to-fine generation strategy. The generator uses an encoder–decoder architecture with attention mechanisms to focus on important regions during the try-on process. Virtual Try-On (VTO) technology has experienced major progress through several essential developments. Human parsing information [2,43,108] provides semantic guidance for realistic garment (Figure 8b) representations and precise placement. Attention mechanisms [40,109] refine image translation by focusing on essential areas while maintaining backgrounds (Table 8). The SPADE implementation [110] improved architectures using segmentation maps for higher-resolution outputs in semantic synthesis. Multi-stage coarse-to-fine frameworks [2,106] employ two-stage processes with low-resolution generation followed by refinement cycles for complex poses and scenes. Geometric matching strategies [40,43] provide improved body-garment alignment, addressing spatial deformations. Recent advancements include flow-based generation [106], adversarial training for occlusion handling [109], and attention-guided synthesis [40] for enhanced photorealism.

4.1.2. Video-Based Methods

Video-based VTO methods [111,112,113,114,115,116,117,118] overcome image-based limitations by modeling temporal information and complex poses. The FW-GAN model introduced flow-navigated warping with optical flow guidance for temporal consistency; however, it exhibited misalignments due to its reliance on external pose estimation. ClothFormer [119] improved results through appearance-flow tracking and optical flow correction for spatio-temporal consistency (Table 9), using transformer-based generators with TPS and appearance-based warping; however, it requires high computational resources. VIBE [120] reconstructed 3D human motion from monocular video using temporal discriminators; however, it was limited to controlled environments. Research in 2024-2025 focused on diffusion methods, including VITON-DiT [118] using Diffusion Transformers, RealVVT [116] with dual U-Net architectures, SwiftTry [115] with ShiftCaching, and DPIDM [114] with dynamic pose interaction, achieving better spatio-temporal consistency than GAN-based methods.

The modeling of direct cloth dynamics through 2D inputs works best for unconstrained videos. MV-TON [122] applied optical flow for frame propagation through recurrent generators and memory refinement to achieve better results on dynamic scenes by estimating garment-to-person flow between frames; however, it experienced quality degradation with large motion due to flow estimation errors. Two-stage frameworks show promise: GPD-VVTO [123] proposed Latent Diffusion Models that decompose video try-on into realistic single-frame generation and temporal coherence maintenance (Table 9), integrating texture and semantic features with temporal attention mechanisms. DreamVVT [124] introduced stage-wise diffusion transformer frameworks, sampling keyframes with multi-frame try-on models and vision-language integration, using keyframes as appearance guidance for video generation with LoRA adapters for temporal coherence; however, it requires significant computational power to achieve state-of-the-art quality. The quantitative evaluation metrics for image-based virtual try-on methods appear in Table 10. Pixel-level similarity assessment through SSIM and MSE shows better preservation when SSIM values increase and MSE values decrease. The evaluation of perceptual realism through LPIPS and FID scores demonstrates that lower scores indicate better similarity to real samples. The early VITON [2] model used pose guidance and a coarse-to-fine strategy to achieve reasonable SSIM/MSE; however, it exhibited higher LPIPS/FID scores.

CP-VTON [43] achieved better results through its Geometric Matching Module with thin-plate spline transformation and Try-On Module with composition masks, which led to significant improvements in all metrics and garment characteristic preservation. The state of the art advanced through SP-VITON [125] and MG-VTON [126] by implementing DensePose-based shape estimation and multi-pose guided try-on with conditional parsing networks and Warp-GAN architectures. The highest metrics were achieved by SwiftTry [115] through its conversion of virtual try-on into conditional video inpainting using diffusion models with temporal attention layers.

4.1.3. 3D Model-Based Methods

Advanced VTO requires the combination of 3D person and garment models with simulation capabilities to fit garments onto models and generate renderings from specific viewpoints for managing intricate poses, viewing angles, and occlusions. Model-based works such as DrapeNet [127] and GarNet [128] employed unsigned distance fields with self-supervised draping for garment generation and two-stream networks for fast 3D cloth draping, respectively. The latest techniques employ graph convolution networks [129,130] with physics-based neural simulation and synthetic datasets such as CLOTH3D [34] for advanced material modeling. Multi-Garment Net [131] predicts body shape and garment layers on SMPL models from video frames to address multi-layered garment challenges. The requirement for precise 3D data and high computational complexity remains a challenge; however, the situation is improving as 3D data becomes more accessible. Tela [132] expanded text-to-3D garment generation by producing 3D clothed humans from text descriptions for natural language avatar creation. The method’s ability to reconstruct 3D human pose and shape is evaluated through SMPL error and pose error measurements (Figure 9a). Lower errors indicate better fitting of the estimated 3D body model to ground truth. The DrapeNet [127] model, which uses unsigned distance fields and self-supervised learning to generate and drape garments, achieved reasonable yet relatively high errors of 7.2 mm and 9.3 degrees. The authors of GarNet [128] developed a two-stream network architecture that processes both garment and body features to reduce SMPL and pose errors through physics-inspired loss functions. Multi-Garment Net [131] advanced the state of the art by learning to dress 3D people with multiple garments from images, obtaining smaller fitting discrepancies through more sophisticated modeling of garment–body interactions and layered garment representations. CLOTH3D [34] achieved the best results with 4.3 mm and 6.5 degrees, respectively, using a large-scale synthetic dataset with realistic cloth dynamics and physics-based simulation. Figure 9b,c shows quantitative metrics for draping and wrinkle generation in virtual try-on methods. The Chamfer Distance (CD) calculates the average distance between predicted and ground truth 3D garment meshes to assess large-scale fold simulation. The Hausdorff Distance measures the maximum local errors, which reveal the areas where simulation results fail. The lower the score, the better the geometric conformity. Mean Squared Error (MSE) compares generated wrinkle patterns to reference deformations, quantifying realistic synthesis of fine-scale creases (Figure 9d). DrapeNet [127] achieved baseline performance (12.1 cm CD, 3.4 cm Hausdorff) using unsigned distance fields. The dual-stream architecture of GarNet [128] enabled better accuracy by processing both garment and body features. Multi-Garment Net [131] processed multiple garments simultaneously, while CLOTH3D [34] utilized extensive synthetic data. Neural Cloth Simulation [129] produced the most effective results through its unsupervised physics-inspired learning approach, which separated static and dynamic subspaces to generate realistic wrinkles.

4.2. Cloth Simulation

Deep learning shows promise for addressing the intricate problem of cloth simulation. Traditional physics-based methods require explicit modeling of complex nonlinear dynamics, which proves to be computationally expensive. Deep neural networks offer an alternative data-driven approach that can directly learn mappings from input states to output deformations from large datasets of cloth motion.

4.2.1. Physics-Based Modeling

Physics-based modeling techniques (Table 11) enable digital garment simulation and animation of realistic dynamics and interactions between virtual objects and phenomena. The accurate modeling of complex physical behaviors, including rigid body dynamics, fluid flows, deformable solids, and their coupling, continues to be an active area of research. The following section examines the evolution of physics-based modeling over the past seven years, with an emphasis on techniques suitable for game development, Virtual Reality applications, and visual effects creation. Position Based Dynamics (PBD) was introduced by Müller et al. [133] as an alternative to penalty-force-based implicit integration for efficiently simulating deformable objects by directly modifying particle positions each timestep to satisfy constraints using iterative projection. The method eliminates stability issues while preserving the performance-accuracy tradeoff of full implicit solvers. In PBD, cloth is represented by n particles with positions

p_{i} \in R^{3}

and masses

m_{i}

. The simulation proceeds by first predicting positions without constraints (

p_{i}^{*} = p_{i} + Δ t \cdot v_{i}

), then iteratively projecting positions onto constraint manifolds. For each constraint

C_{j} (p) = 0

, the position correction is computed as:

Δ p_{i} = - s \cdot \frac{\nabla_{p_{i}} C_{j}}{\sum_{k} w_{k} {∥ \nabla_{p_{k}} C_{j} ∥}^{2}}

(10)

where

s = C_{j} (p)

is the constraint violation,

w_{k} = 1 / m_{k}

are inverse masses, and

\nabla_{p_{i}} C_{j}

is the constraint gradient. Common cloth constraints include distance constraints (

∥ p_{i} - p_{j} ∥ - L_{0} = 0

) to preserve edge lengths and bending constraints (

cos θ - cos θ_{0} = 0

) to resist folding at dihedral angles

θ

between adjacent triangles. The method’s unconditional stability makes it ideal for real-time applications where consistent frame rates are critical. Several follow-up works improved PBD stability for complex scenarios. Extended Position-Based Dynamics (XPBD) was developed by Macklin et al. [134] to address iteration-dependent stiffness behavior in the original PBD method by introducing a compliance formulation that provides timestep-independent material properties. XPBD reformulates constraints using compliance parameters

α = 1 / k

, where k is material stiffness. The constraint solving incorporates Lagrange multipliers

λ

with updates:

Δ λ = \frac{- (C + \tilde{α} λ)}{\sum_{k} w_{k} {∥ \nabla_{p_{k}} C ∥}^{2} + \tilde{α}}

(11)

where

\tilde{α} = α / Δ t^{2}

scales compliance to the per-timestep domain, and the term

\tilde{α} λ

provides implicit damping. This formulation ensures that stiffness k remains constant across different timesteps and iteration counts, enabling physically consistent cloth behavior regardless of simulation frame rate. Jan et al. [135] conducted a thorough survey of position-based simulation methods, which included developments in constraint handling and solver improvements. PBD remains widely used due to its interactive performance and stability even for large simulations. Material Point Methods (MPMs) were first introduced by Stomakhin et al. [136] as a hybrid method that represents continuum materials using Lagrangian particles carrying state information while leveraging Eulerian background grids for spatial operations. MPMs outperform pure Eulerian and Lagrangian methods for nonlinear effects such as large deformations and topological changes. The algorithm alternates between particle and grid representations each timestep. In the Particle-to-Grid (P2G) transfer, particle mass and momentum are projected to grid nodes using interpolation weights

w_{i p}

:

m_{i} = \sum_{p} w_{i p} m_{p}, {(m v)}_{i} = \sum_{p} w_{i p} (m_{p} v_{p} + f_{p} Δ t)

(12)

where subscripts i and p denote grid nodes and particles, respectively. After solving momentum equations on the grid, the Grid-to-Particle (G2P) transfer updates particle velocities and positions:

v_{p}^{n + 1} = \sum_{i} w_{i p} v_{i}^{n + 1}, x_{p}^{n + 1} = x_{p}^{n} + Δ t v_{p}^{n + 1}

(13)

The deformation gradient

F_{p}

tracks material deformation for constitutive modeling:

F_{p}^{n + 1} = (I + Δ t \sum_{i} v_{i}^{n + 1} \otimes \nabla w_{i p}) F_{p}^{n}

(14)

Jiang et al. [137] demonstrated MPM’s capabilities for diverse graphics applications (Table 12), while Guo et al. [138] extended it to thin shells with frictional contact handling, and Hu et al. [139] enhanced performance through moving least squares (MLS-MPM) and GPU acceleration, achieving real-time simulation of complex scenarios with millions of particles. Finite Element Methods (FEMs) provide the most accurate continuum mechanics formulation through variational principles, but traditionally sacrifice real-time interactivity due to computational demands. FEMs discretize cloth into triangular or quadrilateral elements and solves the weak form of equilibrium equations. For a cloth with strain energy density

Ψ (F)

, where

F

is the deformation gradient, the internal force on vertex i is:

f_{i}^{int} = - \frac{\partial Ψ}{\partial x_{i}} = - \sum_{elements} V_{e} \frac{\partial Ψ}{\partial F} : \frac{\partial F}{\partial x_{i}}

(15)

where

V_{e}

is element volume and the colon denotes tensor contraction. Time integration requires solving large sparse linear systems at each timestep, traditionally limiting FEMs to offline cloth simulation.

Table 11. Comparison of Physics-based Techniques.

Technique	Fidelity	Stability	Interactivity	Scalability
PBD [133]	Medium	High	High	Medium
MPM [137]	High	Medium	Medium	High
FEM [140]	High	Medium	Low	Low
XPBD [134]	High	High	Medium	Medium
Cloth MPM [140]	Medium	High	High	Medium
Hybrid Methods [138]	High	Medium	Medium	High

Table 12. Quantitative Comparison of Physics-based Techniques.

Paper	Year	Runtime (ms)	Particles/Elements	Application
[133]	2007	16	5 K particles	Cloth
[136]	2013	80	1 M particles	Snow
[134]	2016	25	10 K particles	Cloth
[137]	2016	120	2 M particles	General MPM
[139]	2018	45	500 K particles	Fluids
[138]	2018	95	100 K particles	Thin shells

However, Georgescu et al. [141] achieved significant GPU acceleration through parallel matrix assembly and iterative solvers, while He et al. [142] presented fully real-time GPU-based FEMs using heterogeneous computing architectures that distribute workload across CPU and GPU resources. Variational Integrators, introduced by Marsden et al. [143], derive time integration schemes directly from variational principles of Lagrangian mechanics, automatically preserving geometric properties such as momentum conservation and symplectic structure. For a cloth system with Lagrangian

L (q, \dot{q}) = T - V

(kinetic minus potential energy), the discrete Euler–Lagrange equations become:

D_{2} L_{d} (q_{k}, q_{k + 1}) + D_{1} L_{d} (q_{k + 1}, q_{k + 2}) = 0

(16)

where

L_{d}

is the discrete Lagrangian approximating the action integral over one timestep, and

D_{1}, D_{2}

denote partial derivatives with respect to first and second arguments. This formulation guarantees long-term energy stability without artificial damping, making variational integrators particularly suitable for cloth animation in feature films requiring extended simulations. Fang et al. [144] enhanced material point methods with affine projection stabilizers to improve efficiency and reduce numerical dissipation in hyperelastic simulations, demonstrating up to 40% reduction in computation time while maintaining accuracy. Modeling the intricate coupling between phenomena demands hybrid approaches. Fei et al. [145] used consistent velocity transfers to couple deformable solids with fluids between MPMs and traditional FEMs, ensuring momentum conservation at material interfaces. The field evolved toward holistic multiphysics modeling through versatile coupling techniques that preserve physical consistency across different material representations. The development of new methods has made it possible to perform interactive editing and authoring of high-fidelity cloth simulations in real-time while maintaining physical accuracy.

4.2.2. Real-Time Simulation

The development of immersive technologies requires cloth simulation as an essential component for achieving realistic virtual try-ons, telepresence, and mixed-reality applications. Real-time cloth simulation must satisfy strict temporal constraints—maintaining stable frame rates (typically 60–90 FPS for VR) while preserving visual plausibility which necessitates algorithmic optimizations and hardware acceleration strategies distinct from offline simulation approaches. Mass-spring models represent cloth as a network of point masses connected by springs, offering computational simplicity amenable to parallel implementation. For a cloth with n vertices, each vertex i experiences spring forces from connected neighbors, with the restoring force proportional to displacement from rest length. Va et al. [146] implemented this model through Unity3D (6000.2.6f2) compute shaders on mobile GPUs, achieving stable 60+ FPS performance on moderate mesh resolutions (50 × 50 vertices) through adaptive constraint relaxation. The parallel force computation distributes vertex updates across GPU threads, with synchronization occurring only at integration boundaries. Their adaptive technique dynamically adjusts spring stiffness based on strain magnitude, preventing numerical instabilities from excessive stretching while maintaining real-time performance on mobile AR/VR devices. Kim et al. [147] conducted systematic performance comparisons between GPU-accelerated Position-Based Dynamics and Unity’s built-in cloth component on Meta Quest 3, demonstrating that custom PBD implementations achieve superior scalability for high-resolution meshes (64 × 64 vertices at 60 FPS). The key advantage stems from PBD’s parallel-friendly constraint solving, where each constraint can be processed independently with position corrections accumulated atomically across GPU threads. Unity’s cloth system, while stable, imposes mesh complexity limits (typically 32 × 32 vertices for standalone headsets) due to its CPU-centric architecture, whereas GPU-native PBD scales efficiently to higher resolutions through massive parallelization of constraint projection operations. Su et al. [148] achieved 1000–2000× speedup over CPU implementations through comprehensive GPU parallelization of all simulation stages force computation, integration, collision detection, and response. Their pipeline integrates depth camera data for body tracking, enabling real-time garment fitting where cloth–body collisions are detected and resolved in parallel across thousands of cloth vertices using GPU-accelerated spatial hashing. This enables interactive virtual try-on scenarios where users can see garment behavior respond immediately to body movements, with the depth camera providing a dense point cloud representing the user’s body geometry. Li et al. [149] developed P-Cloth, which handles complex garment-garment and garment–body interactions through dynamic matrix assembly on GPU. Their method achieves 2–8 FPS on high-resolution meshes (1.65 M triangles) by formulating collision constraints as sparse linear systems solved iteratively on GPU. Matrix assembly exploits parallelism by partitioning spatial regions and building local stiffness contributions independently. Schmitt et al. [150] developed hierarchical GPU surface sampling for multilevel simulation, employing adaptive refinement that allocates computation to visually important regions while coarsening invisible areas, enabling real-time handling of 230,000+ particles through multi-resolution hierarchy processing. Traditional continuous collision detection (CCD) checks for collisions along particle trajectories between timesteps, requiring expensive root-finding for intersection tests. Lan et al. [151] replaced CCD with non-distance barrier methods that use smooth potential functions to prevent penetration. The barrier energy function is inactive when primitives are far apart but produces increasing repulsive forces as distance decreases below an activation threshold. This approach eliminates discrete collision detection entirely, instead incorporating barrier forces during time integration. The method achieved order-of-magnitude speedups (10–50×) compared to CCD while maintaining collision-free results through adaptive time-stepping that ensures sufficiently small timesteps when barrier forces become significant. Sung et al. [152] investigated WebGPU’s capabilities for physics simulation, demonstrating that modern web APIs can execute complex cloth dynamics previously restricted to native applications. WebGPU achieved 60 FPS with 640K nodes through compute shader-based parallel integration and force computation distributed across GPU workgroups. In contrast, WebGL’s graphics-oriented API failed to maintain performance above 10K nodes due to lack of dedicated compute shader support. This advancement enables browser-based virtual try-on applications that require no software installation, democratizing access to interactive cloth simulation across devices. Sung et al. [152] further optimized WebGPU cloth simulation through workgroup-local memory usage and reduced bandwidth consumption, achieving consistent 90 FPS for VR web applications. The Virtual Dressmaker system developed by Yaakop et al. [153] employed client–server architecture separating VR rendering (client) from physics simulation (server), connected via low-latency networking. This architecture enables lightweight VR headsets to deliver high-fidelity cloth simulation by offloading computation to powerful servers. Maintaining total latency below 20 ms the perceptual threshold for VR presence requires optimizing network transmission, simulation timestep, and rendering pipeline. The system uses 6DOF hand tracking for garment manipulation, with interaction forces computed as spring-damper systems between hand position and grasped cloth vertices. Kim et al. [147] extended this paradigm to modern standalone XR platforms (Meta Quest 3), incorporating environmental mesh collision detection that allows virtual garments to interact with real-world furniture scanned through the headset’s depth sensors, creating seamless mixed-reality try-on experiences. The performance of cloth simulation methods in real-time VR/AR applications is evaluated in Table 13. Memory optimization proves critical for resource-constrained mobile VR platforms. Adaptive level-of-detail (LOD) techniques dynamically adjust mesh resolution based on viewer distance and visual importance, with coarser meshes used for distant garments and finer meshes for close-up views. Temporal coherence methods exploit frame-to-frame similarity, reusing collision detection structures across timesteps and avoiding redundant computations. When simulating multiple layered garments simultaneously, hierarchical simulation schedules updates according to visibility and motion activity, focusing computation on active, visible layers while updating static, occluded layers at reduced frequencies. These optimizations collectively enable mobile VR platforms to simulate multiple garment layers while maintaining the 72–90 FPS required for comfortable VR experiences. The emergence of neural-accelerated physics simulation shows promise for reducing computational overhead while preserving visual fidelity. Hybrid approaches train neural networks to predict coarse cloth motion based on body pose and garment type, then refine predictions with minimal physics-based correction iterations to ensure physical constraint satisfaction. This neural prediction provides good initial estimates that reduce the number of expensive physics iterations required, with early results suggesting potential for 5–10× speedups on mobile platforms. Such hybrid methods may enable complex multi-layer cloth simulation on standalone VR headsets without sacrificing visual quality, as the neural component handles smooth, predictable motion while physics corrections handle complex interactions and collisions.

Future Directions and Challenges. Cross-platform compatibility remains a significant challenge, as different VR/AR ecosystems (Meta Quest, Apple Vision Pro, PC VR) require tailored optimization strategies balancing performance with visual quality across diverse hardware configurations. Different platforms have varying computational capabilities, memory bandwidth, and thermal constraints that necessitate platform-specific tuning of simulation parameters and LOD strategies. Future developments in dedicated AI processing units (NPUs) integrated into XR chipsets, combined with specialized graphics architectures featuring hardware-accelerated physics primitives, are expected to further enhance real-time cloth simulation capabilities. These advances may enable photorealistic fabric behavior simulation including fine wrinkles, anisotropic stretching, and complex friction—in consumer-grade VR applications, closing the gap between real-time interactive and offline cinematic cloth simulation quality. The integration of machine learning-based material parameter estimation from captured fabric samples could also enable personalized virtual try-on experiences that accurately reproduce the specific draping characteristics of individual garment items.

4.3. Challenges and Future Outlook

The advancement of virtual try-on technology faces multiple significant challenges and unresolved problems. The task of synthesizing detailed information with complex patterns proves challenging when working with high-resolution outputs. The field requires advanced detail-generation approaches, as it struggles to model realistic garment physics, including wrinkles, folds, and dynamics. The implementation of physics engines with 3D cloth simulation helps address this problem. Virtual try-on systems raise major privacy concerns because they require access to personal biometric information, including body measurements, facial characteristics, and detailed 3D body scanning data [93]. The protection of personal data depends on users maintaining control over their information through clear consent procedures and options to remove stored data. The implementation of on-device processing and federated learning, together with differential privacy techniques, should be employed to protect user data through minimal exposure while preserving system functionality. Currently available datasets suffer from limited diversity in terms of body shapes, combined with restricted variety in poses, views, and garments. Training virtual try-on models requires extensive, diverse datasets; however, ethical data collection methods must be maintained during this process. VTO methods require standardized evaluation protocols together with suitable benchmarks, which are currently unavailable. The field requires rigorous quantitative assessments to monitor progress while ensuring fairness for users of all backgrounds [92]. The deployment of VTO models on resource-constrained mobile devices faces optimization challenges related to model size, latency, and throughput requirements. The accessibility of virtual try-on technology requires universal access across all devices and internet connections for equitable use. VTO systems must address body image effects and avoid supporting unattainable beauty standards. The technology must display diverse representations that include all body types as well as various cultural backgrounds, as noted by Buolamwini et al. [155]. Virtual try-on technology requires specific guidelines to prevent users from misusing these capabilities for unauthorized content generation. The field of VTO research focuses primarily on computer vision methods; however, future development will benefit from multidisciplinary approaches that combine vision with graphics simulation and ethical design principles. Future advancements will include multi-modal input capabilities through 3D data integration, along with advanced generative models for detail synthesis, physics-based simulation techniques, privacy-preserving architectures, domain-specific datasets, and ethical considerations. The future development of virtual try-on technology will advance through responsible AI development to transform consumer shopping behavior in virtual environments while protecting privacy and promoting positive body image.

5. Recommender Fashion

Fashion recommendation [156,157,158,159,160,161,162] aims to suggest fashionable and style-compatible outfits, accessories, or products to users based on their preferences. The expansion of e-commerce, together with computer vision and deep learning technologies, has led to the creation of multiple data-driven methods that enable personalized fashion recommendations at large scales. A chic ensemble requires perfect harmony between fashion elements, as demonstrated in Figure 10a. The objectives of fashion recommendation and fashion synthesis remain distinct despite their shared implementation of deep learning methods. Fashion synthesis produces realistic images and videos of novel fashion products through GANs, either from attributes or existing images. The primary goal is to create authentic fashion content. Fashion recommendation systems apply recommender system and neural network techniques to provide users with suitable products and outfit suggestions based on their preference data and browsing history (Figure 10b). The system learns user patterns from existing item catalogs to predict which additional fashion items users may prefer. The creation of new fashion item content occurs through synthesis; however, recommendation systems focus on matching users with suitable catalog options based on their inferred preferences. This section reviews and compares major deep learning methods for fashion recommendation proposed in recent years.

5.1. Outfit Learning Recommendation

Outfit recommendation [75,76,156,163,164,165,166] is an important application that aims to help users discover new fashion styles and create complementary ensembles. Compared to recommending individual garment items, outfit recommendation presents unique challenges due to the complex relationships between multiple coordinated pieces that constitute a visual style. Traditional content-based and collaborative filtering approaches fail to adequately capture the complex multi-factor aspects of outfit compatibility and harmony. In this section, we explore the use of deep learning models to represent and reason about visual fashion styles at the level of complete multifaceted outfits rather than isolated garments. Our goal is to develop outfit recommendation systems that understand how different garment and accessory combinations create harmonious and unified looks. We perform extensive experiments on public fashion datasets to evaluate different techniques for outfit feature learning, compatibility modeling, and the generation of personalized visual recommendations. As illustrated in Figure 11, research in fashion recommendation has evolved rapidly over the past decade. Figure 11a shows the shift from convolutional and recurrent architectures toward graph neural networks (GNNs) and Transformer-based models, which have become dominant due to their ability to model complex visual and contextual relationships. Figure 11b compares model performance across key fashion tasks, indicating that GNN and Transformer architectures consistently outperform earlier networks in classification, retrieval, and compatibility prediction. Figure 11c highlights the increasing use of multimodal datasets that integrate visual, textual, and attribute information, marking a trend toward richer and more context-aware recommendation systems.

5.1.1. Single-Item Recommenders

Traditional single-item fashion recommenders suggest individual garment articles using content-based or visual similarity approaches, retrieving similar items from input images or text and generating ordered relevance lists within categories. Table 14 shows influential fashion datasets (2015–2022) that advance computer vision and deep learning for fashion, including dataset dimensions, prediction tasks, and benchmark accuracy metrics. As shown in Figure 12, the performance of recommendation methods has improved significantly with the transition from traditional to deep learning architectures. Early approaches such as Matrix Factorization (MF) and Singular Value Decomposition (SVD) achieved moderate accuracy by modeling linear user–item relationships. In contrast, recent neural architectures including Neural Collaborative Filtering (NCF), Bidirectional LSTM, Graph Neural Networks (GNNs), and Transformers demonstrate substantial accuracy gains by capturing non-linear dependencies, sequential patterns, and contextual interactions, underscoring the evolution of recommendation systems toward more expressive and adaptive models. Earlier datasets from the 2010s focused on classification and compatibility with smaller scales, while later datasets increased significantly, enabling advanced tasks such as keypoint detection, segmentation, and outfit recommendation with millions of labeled images.

Dataset sizes [32,37,168] increased steadily with higher model accuracy (80–90%+) for classification and detection benchmarks. Recent larger datasets contain detailed annotations supporting pose estimation, aesthetic assessment, and landmark retrieval [168,170]. The improvements in accuracy are correlated with comprehensive annotations that enable sophisticated algorithms for applications ranging from product search to outfit compatibility [1], including multi-modal approaches with text data. The first deep learning fashion recommendation model emerged in [32] through the DeepFashion dataset and CNN-based methods for attribute recognition and item retrieval without RNN architectures. Early neural approaches demonstrated varying levels of success (Table 14). Han et al. [75] demonstrated that bidirectional LSTM networks perform well for fashion compatibility learning, proving that recurrent architectures are effective for modeling fashion item relationships. RNNs addressed sequential data and temporal dependencies effectively; however, their large-scale recommendation capabilities were limited by computational complexity and vanishing gradient problems. Deep learning models require well preprocessed images because fashion item detection necessitates extracting garments from complete images. The comprehensive annotation system developed by Zheng et al. [37] standardized detected items through polygon level annotations. The improvements in preprocessing techniques resulted in enhanced performance for garment retrieval and classification throughout fashion recommendation systems.

Neural network approaches have become essential components of modern fashion recommendation systems, as shown in Table 15. These systems employ deep learning architectures to enable advanced applications such as visual search and personalized outfit recommendation. The analysis in previous sections focused on single-item recommendation approaches that studied fashion articles after detection and classification. The item detection and segmentation capabilities of Faster R-CNN [171] and Mask R-CNN [172] have established a foundation for content-based recommendation of similar individual garments. The DeepFashion framework [32] introduced deep learning applications in fashion by first detecting objects and recognizing attributes before making multi-piece recommendations. Traditional frameworks analyzed fashion products independently; however, they failed to understand how these items work together as coordinated ensembles. The individual item suggestion approach proved successful; however, it failed to recognize how different items work together to form complete outfits that meet both visual appeal and situational requirements, as noted by Cheng et al. [1]. The absence of complete outfit recommendation frameworks led researchers to develop personalized outfit generation systems at Alibaba iFashion and comprehensive outfit compatibility models, which will be examined in the following sections. Modern single-item recommendation systems employ advanced deep learning methods to enhance the precision of individual item suggestions. Li et al. [173] demonstrate how deep neural networks improve single-item collaborative filtering by detecting intricate non-linear patterns in user–item interactions. Liu et al. [174] demonstrate that flow-based methods excel at single-item recommendations through their discrete flow frameworks, which maintain binary implicit feedback while producing excellent individual item suggestions. The chronological progression (Table 16) from 2015 to 2024 demonstrates the field’s advancement from basic bounding box detection (Faster R-CNN) to instance segmentation capabilities (Mask R-CNN), followed by the YOLO family’s continuous optimization for real-time performance through architectural innovations including backbone improvements (Darknet-53), anchor-free design, and specialized frameworks (PGI, GELAN). Early systems based on matrix factorization approaches [175] used collaborative filtering techniques to model user–item interactions and predict ratings for individual garments. However, they did not leverage other important contextual cues such as styling attributes, sequential data, and textual reviews that could enhance single-item recommendations. More recent deep learning methods helped address these limitations. The predicted preference

{\hat{r}}_{u, i}

for user u on item i is given by:

{\hat{r}}_{u, i} = p_{u}^{⊤} q_{i},

(17)

where

p_{u}, q_{i} \in R^{k}

are the latent representations of user u and item i, respectively. The objective minimizes the reconstruction error as:

L = \sum_{(u, i) \in R} {(r_{u, i} - p_{u}^{⊤} q_{i})}^{2} + λ (∥ p_{u} ∥^{2} + ∥ q_{i} ∥^{2}) .

(18)

For instance, Neural Collaborative Filtering (NCF) [176] incorporated neural architectures into collaborative filtering to capture non-linear user–item relationships. In Neural Collaborative Filtering (NCF), the inner product in matrix factorization is replaced by a non-linear neural interaction function:

{\hat{y}}_{u, i} = f_{NN} ([p_{u} ∥ q_{i}]),

(19)

where

f_{NN}

denotes a multilayer perceptron that captures complex user–item interactions. The model is optimized using binary cross-entropy:

L_{NCF} = - \sum_{(u, i)} [y_{u, i} \log {\hat{y}}_{u, i} + (1 - y_{u, i}) \log (1 - {\hat{y}}_{u, i})] .

(20)

Recent advances include sophisticated neural networks that address traditional limitations [173] and flow-based models for improved interaction modeling [174].

5.1.2. Outfit Style Recommenders

Complete outfit recommendation systems analyze entire ensembles rather than individual pieces, proposing cohesive combinations that maintain overall aesthetic harmony. These systems must understand how different garment pieces work together as a unified whole. Generative Adversarial Networks (GANs) address data scarcity in fashion synthesis. Reed et al. [183] introduced conditional GANs for text-to-image generation. Zhu et al. [184] proposed CycleGAN for unpaired style transfer, while Zhu et al. [185] developed BicycleGAN for multimodal translation. Isola et al. [186] presented pix2pix for fashion design applications. The scalability of datasets and the alignment of user preferences remain challenging for GANs despite their ability to address cold-start issues. Physical attribute-based systems leverage body measurements for personalization. Hsiao et al. [187] utilized computer vision for fashion compatibility analysis. Han et al. [75] designed bidirectional LSTM networks for modeling compatibility relationships. Liu et al. [32] provided annotations enabling attribute-based recommendations. The collection of physical data raises privacy concerns that require privacy-preserving methods to address. Virtual and Augmented Reality technologies enable immersive try-on experiences. Wang et al. [43] developed frameworks combining computer graphics with machine learning for realistic garment simulation. These systems address sizing and fit problems while offering interactive shopping experiences. Multimedia platforms drive dynamic outfit coordination research. Chen et al. [163] developed large-scale personalized outfit generation systems for production environments. These multimodal approaches incorporate temporal dynamics, contextual awareness, and cross-modal understanding for sophisticated personalized experiences.

5.2. Context-Aware Recommendation

Recommendation systems that utilize contextual information enhance suggestion precision by analyzing the environment in which users interact. The main difference between traditional recommendation systems and context-aware systems lies in their approach to user–item interactions: traditional systems focus on user preferences and item characteristics, whereas context-aware systems understand how time, location, and user situations affect recommendations. User preferences vary according to the context in which recommendations are displayed. The system uses contextual data to generate personalized recommendations at the right time, thereby enhancing user satisfaction. The user experience benefits from recommendations tailored to their needs through the integration of contextual dimensions including occasion, location, time, companions, task, weather, and device. The goal of context-aware recommendation systems is to provide customized solutions through their ability to accommodate events, weather conditions, and device usage patterns.

5.2.1. Occasion-Based Recommendation Systems

The development of context-aware and graph neural network-based recommendation frameworks experienced substantial growth between 2020 and 2025 due to expanding datasets and advanced network architectures. Gao et al. [188] conducted an extensive review of graph neural networks for recommender systems, demonstrating how GNNs handle complex user–item–context relationships. Wu et al. [189] developed a graph convolution machine for context-aware recommender systems, which demonstrates how graph neural networks can handle both high-order feature interactions and contextual dependencies. Recent developments in attention mechanisms show superior performance in context-aware recommendation systems. The Attentive Interaction Network (AIN) developed by Mei et al. [190] uses attention mechanisms to explicitly model context–user–item feature interactions for context-aware recommendations. The authors demonstrated that different contexts produce different effects on user preferences through attention mechanisms, which enable the modeling of these dynamic relationships. The AIN model includes three essential components: (1) context-aware feature interaction layers that describe how contexts affect user and item representations, (2) an attention network that determines the weight of various context–feature interactions, and (3) a prediction layer that combines the attended features to generate final recommendations. Xin et al. [191] presented Convolutional Factorization Machines (CFMs) for context-aware recommendation, which unite convolutional neural networks with factorization machines to analyze contextual data patterns at multiple levels. The research showed better results than conventional context-aware methods on multiple benchmark datasets, demonstrating that adding spatial and temporal context information leads to improved recommendation accuracy and relevance. Rashed et al. [192] created CARCA (Context and Attribute-aware Sequential Recommendation via Cross-Attention) to model user profile dynamics through dedicated multi-head self-attention blocks. While traditional recommendation methods use basic similarity calculations, CARCA employs cross-attention between profile items and target items to produce more precise recommendations. Real-world evaluations proved that attention-based contextual interaction modeling produces superior recommendation results.

5.2.2. Climate-Adaptive Fashion Recommendation

The integration of climate data into systems enables personalized garment suggestions through climate-adaptive fashion recommendation. The Weather-to-Garment (WoG) system developed by Liu et al. [193] enables automatic wardrobe selection from user wardrobes based on weather data (Table 17). The authors established a scoring function with three terms to model the relationship through garment attributes, which act as a mid-level bridge connecting low-level features with high-level weather categories. Chen et al. [194] introduced a personalized fashion recommendation system that uses multimodal attention networks to generate visual explanations through the combination of review information modeled by LSTM with visual features integrated into the generation process. The fashion image attention mechanism divides images into 49 small 7 × 7 regions for detailed preference modeling, which enhances user preference adaptation through the combination of textual and visual modalities. These adaptive systems demonstrate that environmental context integration leads to improved outfit appropriateness and user satisfaction. The reusable self-attention-based recommender system for fashion (AFRA) developed by Celikik et al. [195] demonstrates effective attention mechanism applications in real-world fashion recommendation scenarios through various interaction types with fashion entities. Ongoing research in this field shows promise for boosting both customer satisfaction and business performance for online retailers. The integration of climate data will enhance model precision and lead to more personalized fashion recommendations globally. The Aspect-Based Fashion Recommendation model with Attention Mechanism (AFRAM) developed by Li et al. [196] uses online reviews of fashion products to forecast customer ratings. The model employs parallel CNN and LSTM attention paths to extract user and item latent aspects, which addresses the challenge of combining local and global aspect representations from customer feedback. Ahmed et al. [197] developed a deep transfer learning system to classify air temperatures through human clothing images, demonstrating how garment selection reveals environmental conditions. The authors achieved 98.13% temperature classification accuracy through their deep transfer learning methods, including ViT, ResNet101, and DenseNet121, thus proving the strong relationship between garment choices and weather conditions that climate-adaptive fashion systems can utilize.

5.3. Open Challenges and Future Research Directions

The application of deep learning methods to fashion recommendations has shown significant progress; however, multiple obstacles remain. The field of fashion recommendation faces a major obstacle due to the lack of sufficient large-scale public datasets, which are abundant in computer vision. Model generalization ability suffers from this constraint, particularly when attempting to detect rare fashion items in the market. Current models experience difficulties when generating time-based recommendations using implicit user feedback or tracking changing fashion trends. Fashion recommendation systems risk perpetuating biases from training data, which leads to the reinforcement of stereotypes about gender, body type, age, and cultural background. These systems may create user filter bubbles that restrict users from discovering diverse fashion styles and may also discriminate against specific demographic groups. Addressing algorithmic fairness requires proper dataset curation alongside bias detection tools and fairness-based recommendation algorithms that will enhance fashion discovery for all users. The collection of user data by recommendation systems includes extensive information about browsing activities, purchase records, body measurements, and personal taste preferences. The privacy requirements of sensitive information call for strong protection measures that include differential privacy and federated learning [198]. Users require transparent opt-out options and complete understanding of how their data influences recommendation results to maintain control over their information. The integration of contextual aspects, including occasion and season, demands better comprehension of common-sense knowledge. Current models face difficulties with new users and items due to cold start problems and struggle to maintain diverse recommendation outputs. The system must explain recommendation decisions to users because this transparency is necessary for building user trust and maintaining system accountability [199].

Fashion recommendation systems must honor cultural norms and avoid imposing Western beauty standards as a global recommendation framework. Adapting recommendations for diverse cultures demands an understanding of different fashion customs and prevention of cultural appropriation in recommendation systems. Addressing these challenges requires adopting lifelong learning approaches that simultaneously expand knowledge while protecting user privacy. Natural multimodal interaction through language, visuals, and conversational agents provides opportunities to create personalized and interpretable recommendations. The research community should develop recommendation systems that are bias-aware and employ privacy-preserving techniques together with culturally sensitive approaches. The responsible potential of deep learning in fashion recommendation reaches its peak when data scarcity is overcome and models demonstrate human-like perception and fashion-seeking behavior without violating ethical standards.

6. Open-Source Resources and Pre-Trained Models

To facilitate reproducibility and practical implementation, we summarize publicly available code repositories (Table 18), pre-trained models, and datasets that have significantly contributed to fashion computer vision research. This compilation enables researchers to build upon existing work efficiently, accelerating progress in the field while ensuring fair comparison across methods. We prioritize resources that provide both source code and pre-trained weights, enabling direct application and transfer learning for custom fashion datasets.

6.1. Key Datasets with Public Access

DeepFashion [32]: 800 K images with rich annotations. http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html (accessed on 10 December 2025)
Fashion-MNIST [33]: 70 K grayscale clothing images. https://github.com/zalandoresearch/fashion-mnist (accessed on 10 December 2025)
Fashion-Gen [35]: 293 K images with text descriptions. Dataset access through original authors.
ModaNet [37]: 55 K street fashion images with polygon annotations. https://github.com/eBay/modanet (accessed on 10 December 2025)
VITON-HD [3]: 13 K high-resolution virtual try-on pairs. https://github.com/shadow2496/VITON-HD (accessed on 10 December 2025)
CLOTH3D [34]: 100 K 3D garment meshes with simulation data. https://chalearnlap.cvc.uab.cat/dataset/38/description (accessed on 10 December 2025)
Fashionpedia [36]: 48 K images with fine-grained attributes. https://fashionpedia.github.io/home (accessed on 10 December 2025)
DressCode [41]: 53 K multi-category virtual try-on images. https://github.com/aimagelab/dress-code (accessed on 10 December 2025)
DeepFashion2 [168]: 491 K images with pose and segmentation annotations. https://github.com/switchablenorms/DeepFashion2 (accessed on 10 December 2025)
Polyvore (archived): 21 K outfit sets for compatibility learning. https://github.com/xthan/polyvore (accessed on 10 December 2025)

6.2. Pre-Trained Models for Transfer Learning

Researchers can leverage powerful vision foundation models—including ResNet-50, EfficientNet, and Vision Transformers (ViT) as backbone feature extractors for a wide range of fashion understanding tasks. These architectures provide hierarchical and semantically rich representations that capture garment texture, silhouette, color composition, and fine-grained attributes. Pre-trained weights on large-scale datasets such as ImageNet serve as strong initialization points, enabling rapid convergence and improved generalization when fine-tuned on fashion-specific datasets (e.g., DeepFashion, Fashion-MNIST, or bespoke e-commerce collections). This transfer learning pipeline significantly reduces the computational cost, data requirements, and training time while maintaining high performance even in low-resource scenarios. For generative modeling, StyleGAN2-ADA offers substantial advantages for fashion synthesis. Through adaptive discriminator augmentation, the model stabilizes training under extreme data scarcity and prevents discriminator overfitting. As a result, researchers can train high-quality garment generators with only a few hundred images, enabling applications such as new clothing design, virtual try-on garment synthesis, and controllable fashion editing. The ability of StyleGAN2-ADA to generalize from small datasets makes it particularly suitable for niche product categories, small brands, or historical fashion collections where large curated datasets are unavailable.

6.3. Computational Requirements

Most state-of-the-art models require high-end GPUs (NVIDIA A100/V100 with 32–40 GB VRAM) for training from scratch. However, fine-tuning pre-trained models can be performed on mid-range GPUs (RTX 3080/4090 with 10–24 GB VRAM). Inference typically runs efficiently on consumer hardware. Cloud platforms including Google Colab Pro (V100/A100 access), AWS SageMaker (p4d instances with 8× A100), and Paperspace Gradient provide accessible alternatives for researchers with limited computational resources. Free-tier options like Kaggle Notebooks offer 30 h/week of T4 GPU access suitable for experimentation and fine-tuning.

7. Cross-Domain Integration and Synergies

The three primary domains generative fashion, simulative fashion, and recommender systems increasingly converge to create integrated fashion technology ecosystems. This section examines how these domains intersect, reinforce one another, and enable novel applications through their synergistic combination.

7.1. Generative-Simulative Integration

Modern virtual try-on systems exemplify the convergence of generative and simulative approaches. Early methods relied purely on GANs for image synthesis [2,43], but often produced physically implausible artifacts such as unrealistic wrinkles or garment–body interpenetration. Recent advances incorporate physics-based simulation to ensure realistic draping and collision handling [129]. ClothFormer [119] demonstrates this hybrid paradigm: transformer-based generative models predict appearance flow for texture mapping, while physics-inspired constraints enforce mechanical plausibility. This combination achieves both visual realism (through learned appearance models) and physical correctness (through simulation constraints). Generative models also address simulation’s data scarcity problem. CLOTH3D [34] uses physics simulation to create synthetic training data (100K 3D garments), which then trains neural simulators [129] achieving real-time performance. This creates a virtuous cycle: simulation generates ground truth data, generative models learn efficient approximations, and resulting neural simulators enable interactive applications.

7.2. Generative-Recommender Integration

Traditional recommender systems retrieve existing inventory items, limiting suggestions to available products. Generative models overcome this constraint through synthetic outfit completion. When recommendation systems identify compatibility gaps (e.g., “this outfit needs a light blue cardigan”), generative models synthesize novel items meeting these requirements [26]. OutfitGAN demonstrates this capability by conditioning generation on existing outfit items through semantic alignment modules. User preference profiles from recommender systems guide personalized generation. The POG system at Alibaba iFashion [163] analyzes user interaction data (purchases, browsing) to extract style preferences, which then condition generative models for personalized outfit synthesis. This enables true personalization: different users viewing the same seed item receive distinct generated recommendations reflecting individual tastes. Both domains benefit from shared representation learning. Type-aware embeddings for compatibility prediction [76] guide generative models toward style-consistent outputs, while generative pre-training provides powerful feature extractors for recommendation tasks.

7.3. Simulative-Recommender Integration

Cloth simulation enhances context-aware recommendations. Weather-aware systems [193] suggest appropriate clothing based on forecasts, with integrated simulation showing garment behavior under specific conditions for instance, demonstrating how a dress moves in wind or how a coat performs in cold weather. This addresses the gap between static product images and real-world garment behavior. Body-aware recommendations leverage 3D body estimation from simulation. Multi-Garment Net [131] reconstructs detailed body shape from images, enabling simulation-based fit prediction. Recommender systems can simulate how candidate garments drape on the user’s specific body before recommendation, filtering poorly fitting items. This reduces sizing uncertainty and return rates in online fashion retail. Virtual try-on interaction data informs recommendations. Simulation systems track user behavior (viewing duration, zoom patterns, garment manipulation), revealing implicit preferences and concerns. Recommendation algorithms incorporate these signals to refine future suggestions.

7.4. Tri-Domain Integration

Emerging platforms integrate all three domains into unified ecosystems spanning the complete user journey:

Discovery (Recommender): Context-aware systems suggest occasions and style directions based on user preferences and contextual signals.
Customization (Generative): Users modify suggestions or generate personalized designs through text descriptions or style transfer.
Validation (Simulative): Virtual try-on with physics-based simulation confirms fit and appearance before purchase.

Vision-language models like CLIP provide unified embeddings useful across domains: text-to-image generation (generative), natural language search queries (recommender), and semantic garment understanding (simulative). This multi-domain representation learning reduces redundancy and enables knowledge transfer across tasks. Key opportunities include unified foundation models handling generation, simulation, and recommendation within single architectures; differentiable simulation enabling end-to-end training across domains; and multimodal fusion integrating text, images, 3D geometry, and user preferences holistically. Challenges include computational efficiency for real-time tri-domain processing, consistency across domains (ensuring generated items simulate realistically), and seamless user experience across discovery, customization, and validation workflows. The convergence of these domains represents the future of computational fashion, evolving from isolated tools toward integrated intelligent fashion assistants that understand, generate, simulate, and recommend in unified frameworks.

8. Conclusions

This systematic review examined recent progress across generative fashion modeling, simulation, and recommendations through comprehensive analysis of 200 studies from 2017 to 2025. Significant advances have been achieved in GAN-based and diffusion-based garment generation, physics-based cloth simulation achieving real-time performance, and context-aware recommendation systems. Despite this progress, challenges remain including limited dataset diversity, computational constraints, privacy concerns, and algorithmic biases. Future research should prioritize developing inclusive datasets, privacy-preserving architectures, culturally sensitive algorithms, and efficient models for edge devices. The convergence of generative AI, physics-based simulation, and intelligent recommendation systems holds immense potential for transforming the textile and fashion industry toward more sustainable, personalized, and accessible digital experiences.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info17010011/s1, PRISMA 2020 Checklist [200].

Author Contributions

Conceptualization, I.K. and S.E.A.; methodology, I.K.; software, I.K.; validation, I.K. and S.E.A.; formal analysis, I.K.; investigation, I.K.; resources, I.K.; data curation, I.K.; writing—original draft preparation, I.K.; writing—review and editing, S.E.A.; visualization, I.K.; supervision, S.E.A.; project administration, S.E.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article as this is a systematic review of existing literature.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, W.H.; Song, S.; Chen, C.Y.; Hidayati, S.C.; Liu, J. Fashion meets computer vision: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 72. [Google Scholar] [CrossRef]
Han, X.; Wu, Z.; Wu, Z.; Yu, R.; Davis, L.S. Viton: An image-based virtual try-on network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7543–7552. [Google Scholar]
Choi, S.; Park, S.; Lee, M.; Choo, J. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14131–14140. [Google Scholar]
Santesteban, I.; Otaduy, M.A.; Casas, D. Learning-based animation of clothing for virtual try-on. Comput. Graph. Forum 2019, 38, 355–366. [Google Scholar] [CrossRef]
Lahner, Z.; Cremers, D.; Tung, T. Deepwrinkles: Accurate and realistic clothing modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 667–684. [Google Scholar]
He, T.; Hu, Y. FashionNet: Personalized outfit recommendation with deep neural network. arXiv 2018, arXiv:1810.02443. [Google Scholar] [CrossRef]
Ding, Y.; Lai, Z.; Mok, P.; Chua, T.S. Computational technologies for fashion recommendation: A survey. ACM Comput. Surv. 2023, 56, 121. [Google Scholar] [CrossRef]
Deldjoo, Y.; Nazary, F.; Ramisa, A.; Mcauley, J.; Pellegrini, G.; Bellogin, A.; Noia, T.D. A review of modern fashion recommender systems. ACM Comput. Surv. 2023, 56, 87. [Google Scholar] [CrossRef]
Chakraborty, S.; Hoque, M.S.; Rahman Jeem, N.; Biswas, M.C.; Bardhan, D.; Lobaton, E. Fashion recommendation systems, models and methods: A review. Informatics 2021, 8, 49. [Google Scholar] [CrossRef]
Mohammadi, S.O.; Kalhor, A. Smart fashion: A review of AI applications in the Fashion & Apparel Industry. arXiv 2021, arXiv:2111.00905. [Google Scholar] [CrossRef]
Gong, W.; Khalid, L. Aesthetics, personalization and recommendation: A survey on deep learning in fashion. arXiv 2021, arXiv:2101.08301. [Google Scholar] [CrossRef]
Liu, J.; Chen, Y.; Ni, B.; Yu, Z. Joint global and dynamic pseudo labeling for semi-supervised point cloud sequence segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5679–5691. [Google Scholar] [CrossRef]
Ramos, L.; Rivas-Echeverría, F.; Pérez, A.G.; Casas, E. Artificial intelligence and sustainability in the fashion industry: A review from 2010 to 2022. SN Appl. Sci. 2023, 5, 387. [Google Scholar] [CrossRef]
Ma, X.; Zhang, F.; Wei, H.; Xu, L. Deep learning method for makeup style transfer: A survey. Cogn. Robot. 2021, 1, 182–187. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Indolia, S.; Goswami, A.K.; Mishra, S.P.; Asopa, P. Conceptual understanding of convolutional neural network-a deep learning approach. Procedia Comput. Sci. 2018, 132, 679–688. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Li, T.; Qian, R.; Dong, C.; Liu, S.; Yan, Q.; Zhu, W.; Lin, L. Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 645–653. [Google Scholar]
Chang, H.; Lu, J.; Yu, F.; Finkelstein, A. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 40–48. [Google Scholar]
Jiang, W.; Liu, S.; Gao, C.; Cao, J.; He, R.; Feng, J.; Yan, S. Psgan: Pose and expression robust spatial-aware gan for customizable makeup transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5194–5202. [Google Scholar]
Tugwell, P.; Tovey, D. PRISMA 2020. J. Clin. Epidemiol. 2021, 134, A5–A6. [Google Scholar] [CrossRef]
Sarmiento, J.A. Exploiting latent codes: Interactive fashion product generation, similar image retrieval, and cross-category recommendation using variational autoencoders. arXiv 2020, arXiv:2009.01053. [Google Scholar] [CrossRef]
Jeong, J.; Park, H.; Lee, Y.; Kang, J.; Chun, J. Developing parametric design fashion products using 3D printing technology. Fash. Text. 2021, 8, 22. [Google Scholar] [CrossRef]
Zhu, Z.; Xu, Z.; You, A.; Bai, X. Semantically multi-modal image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5467–5476. [Google Scholar]
Zhu, S.; Urtasun, R.; Fidler, S.; Lin, D.; Change Loy, C. Be your own prada: Fashion synthesis with structural coherence. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1680–1688. [Google Scholar]
Chen, L.; Tian, J.; Li, G.; Wu, C.H.; King, E.K.; Chen, K.T.; Hsieh, S.H.; Xu, C. Tailorgan: Making user-defined fashion designs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3241–3250. [Google Scholar]
Jetchev, N.; Bergmann, U. The conditional analogy gan: Swapping fashion articles on people images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2287–2292. [Google Scholar]
Xian, W.; Sangkloy, P.; Agrawal, V.; Raj, A.; Lu, J.; Fang, C.; Yu, F.; Hays, J. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8456–8465. [Google Scholar]
Abdellaoui, S.; Kachbal, I. Apparel E-commerce background matting. Int. J. Adv. Res. Eng. Technol. (IJARET) 2021, 12, 421–429. [Google Scholar]
El Abdellaoui, S.; Kachbal, I. Deep residual network for high-resolution background matting. Stud. Inf. Control 2021, 30, 51–59. [Google Scholar] [CrossRef]
El Abdellaoui, S.; Kachbal, I. Deep background matting. In Proceedings of the International Conference on Smart City Applications, Castelo Branco, Portugal, 19–21 October 2022; Springer: Cham, Switzerland, 2022; pp. 523–532. [Google Scholar]
Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1096–1104. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
Bertiche, H.; Madadi, M.; Escalera, S. Cloth3d: Clothed 3d humans. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 344–359. [Google Scholar]
Rostamzadeh, N.; Hosseini, S.; Boquet, T.; Stokowiec, W.; Zhang, Y.; Jauvin, C.; Pal, C. Fashion-gen: The generative fashion dataset and challenge. arXiv 2018, arXiv:1806.08317. [Google Scholar] [CrossRef]
Jia, M.; Shi, M.; Sirotenko, M.; Cui, Y.; Cardie, C.; Hariharan, B.; Adam, H.; Belongie, S. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 316–332. [Google Scholar]
Zheng, S.; Yang, F.; Kiapour, M.H.; Piramuthu, R. Modanet: A large-scale street fashion dataset with polygon annotations. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1670–1678. [Google Scholar]
Ma, L.; Sun, Q.; Georgoulis, S.; Van Gool, L.; Schiele, B.; Fritz, M. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 99–108. [Google Scholar]
Minar, M.R.; Tuan, T.T.; Ahn, H.; Rosin, P.; Lai, Y.K. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In Proceedings of the CVPR Workshops, Seattle, WA, USA, 14–19 June 2020; Volume 3, pp. 10–14. [Google Scholar]
Yang, H.; Zhang, R.; Guo, X.; Liu, W.; Zuo, W.; Luo, P. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7850–7859. [Google Scholar]
Davide, M.; Matteo, F.; Marcella, C.; Federico, L.; Fabio, C.; Rita, C. Dress code: High-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; Volume 1. [Google Scholar]
Wang, W.; Ho, H.I.; Guo, C.; Rong, B.; Grigorev, A.; Song, J.; Zarate, J.J.; Hilliges, O. 4d-dress: A 4d dataset of real-world human clothing with semantic annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 550–560. [Google Scholar]
Wang, B.; Zheng, H.; Liang, X.; Chen, Y.; Lin, L.; Yang, M. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 589–604. [Google Scholar]
Dong, H.; Liang, X.; Zhang, Y.; Zhang, X.; Shen, X.; Xie, Z.; Wu, B.; Yin, J. Fashion editing with adversarial parsing learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8120–8128. [Google Scholar]
Yildirim, G.; Jetchev, N.; Vollgraf, R.; Bergmann, U. Generating high-resolution fashion model images wearing custom outfits. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Cui, Y.R.; Liu, Q.; Gao, C.Y.; Su, Z. FashionGAN: Display your fashion design using conditional generative adversarial nets. Comput. Graph. Forum 2018, 37, 109–119. [Google Scholar] [CrossRef]
Lewis, K.M.; Varadharajan, S.; Kemelmacher-Shlizerman, I. Tryongan: Body-aware try-on via layered interpolation. ACM Trans. Graph. (TOG) 2021, 40, 115. [Google Scholar] [CrossRef]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
Fu, J.; Li, S.; Jiang, Y.; Lin, K.Y.; Qian, C.; Loy, C.C.; Wu, W.; Liu, Z. Stylegan-human: A data-centric odyssey of human generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–19. [Google Scholar]
Gupta, K.; Damani, S.; Narahari, K.N. Using AI to Design Stone Jewelry. arXiv 2018, arXiv:1811.08759. [Google Scholar] [CrossRef]
Li, Y.; Wen, H. Jewelry Art Modeling Design Method Based on Computer-Aided Technology. Adv. Multimed. 2022, 2022, 4388128. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 2019, 38, 146. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Biasotti, S.; Cerri, A.; Bronstein, A.; Bronstein, M. Recent trends, applications, and perspectives in 3d shape similarity assessment. Comput. Graph. Forum 2016, 35, 87–119. [Google Scholar] [CrossRef]
Shu, D.W.; Park, S.W.; Kwon, J. 3d point cloud generative adversarial network based on tree structured graph convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3859–3868. [Google Scholar]
Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; Guibas, L. Learning representations and generative models for 3d point clouds. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 40–49. [Google Scholar]
Nichol, A.; Jun, H.; Dhariwal, P.; Mishkin, P.; Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv 2022, arXiv:2212.08751. [Google Scholar] [CrossRef]
Schröppel, P.; Wewer, C.; Lenssen, J.E.; Ilg, E.; Brox, T. Neural point cloud diffusion for disentangled 3d shape and appearance generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8785–8794. [Google Scholar]
Feng, Q. Makeup Transfer Using Generative Adversarial Network. Ph.D. Thesis, Nanyang Technological University, Singapore, 2022. [Google Scholar]
Bougourzi, F.; Dornaika, F.; Barrena, N.; Distante, C.; Taleb-Ahmed, A. CNN based facial aesthetics analysis through dynamic robust losses and ensemble regression. Appl. Intell. 2023, 53, 10825–10842. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Song, Y.; Liu, J. Stablemakeup: When real-world makeup transfer meets diffusion model. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, Vancouver, BC, Canada, 10–14 August 2025; pp. 1–9. [Google Scholar]
Sun, Z.; Xiong, S.; Chen, Y.; Du, F.; Chen, W.; Wang, F.; Rong, Y. Shmt: Self-supervised hierarchical makeup transfer via latent diffusion models. Adv. Neural Inf. Process. Syst. 2024, 37, 16016–16042. [Google Scholar]
He, F.; Li, H.; Ning, X.; Li, Q. BeautyDiffusion: Generative latent decomposition for makeup transfer via diffusion models. Inf. Fusion 2025, 123, 103241. [Google Scholar] [CrossRef]
Zhu, J.; Liu, S.; Li, L.; Gong, Y.; Wang, H.; Cheng, B.; Ma, Y.; Wu, L.; Wu, X.; Leng, D.; et al. FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer. arXiv 2025, arXiv:2508.05069. [Google Scholar]
Lebedeva, I.; Guo, Y.; Ying, F. MEBeauty: A multi-ethnic facial beauty dataset in-the-wild. Neural Comput. Appl. 2022, 34, 14169–14183. [Google Scholar] [CrossRef]
Deb, D.; Zhang, J.; Jain, A.K. Advfaces: Adversarial face synthesis. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–10. [Google Scholar]
Hu, S.; Liu, X.; Zhang, Y.; Li, M.; Zhang, L.Y.; Jin, H.; Wu, L. Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 15014–15023. [Google Scholar]
Dabouei, A.; Soleymani, S.; Dawson, J.; Nasrabadi, N. Fast geometrically-perturbed adversarial faces. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1979–1988. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Elsheikh, R.A.; Mohamed, M.; Abou-Taleb, A.M.; Ata, M.M. Accuracy is not enough: A heterogeneous ensemble model versus FGSM attack. Complex Intell. Syst. 2024, 10, 8355–8382. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Veit, A.; Kovacs, B.; Bell, S.; McAuley, J.; Bala, K.; Belongie, S. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4642–4650. [Google Scholar]
Han, X.; Wu, Z.; Jiang, Y.G.; Davis, L.S. Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1078–1086. [Google Scholar]
Vasileva, M.I.; Plummer, B.A.; Dusad, K.; Rajpal, S.; Kumar, R.; Forsyth, D. Learning type-aware embeddings for fashion compatibility. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 390–405. [Google Scholar]
Yang, X.; Du, X.; Wang, M. Learning to match on graph for fashion compatibility modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 287–294. [Google Scholar]
Cui, Z.; Li, Z.; Wu, S.; Zhang, X.Y.; Wang, L. Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 307–317. [Google Scholar]
Nakamura, T.; Goto, R. Outfit generation and style extraction via bidirectional lstm and autoencoder. arXiv 2018, arXiv:1807.03133. [Google Scholar] [CrossRef]
Li, Z.; Li, J.; Wang, T.; Gong, X.; Wei, Y.; Luo, P. Ocphn: Outfit compatibility prediction with hypergraph networks. Mathematics 2022, 10, 3913. [Google Scholar] [CrossRef]
Li, X.; Wang, X.; He, X.; Chen, L.; Xiao, J.; Chua, T.S. Hierarchical fashion graph network for personalized outfit recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 159–168. [Google Scholar]
Hsiao, W.L.; Katsman, I.; Wu, C.Y.; Parikh, D.; Grauman, K. Fashion++: Minimal edits for outfit improvement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5047–5056. [Google Scholar]
Liu, L.; Zhang, H.; Zhou, D. Clothing generation by multi-modal embedding: A compatibility matrix-regularized GAN model. Image Vis. Comput. 2021, 107, 104097. [Google Scholar] [CrossRef]
Shen, Y.; Huang, R.; Huang, W. GD-StarGAN: Multi-domain image-to-image translation in garment design. PLoS ONE 2020, 15, e0231719. [Google Scholar] [CrossRef]
Zhou, D.; Zhang, H.; Yang, K.; Liu, L.; Yan, H.; Xu, X.; Zhang, Z.; Yan, S. Learning to synthesize compatible fashion items using semantic alignment and collocation classification: An outfit generation framework. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5226–5240. [Google Scholar] [CrossRef]
Liu, L.; Zhang, H.; Ji, Y.; Wu, Q.J. Toward AI fashion design: An Attribute-GAN model for clothing match. Neurocomputing 2019, 341, 156–167. [Google Scholar] [CrossRef]
Singh, K.K.; Ojha, U.; Lee, Y.J. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6490–6499. [Google Scholar]
Li, K.; Liu, C.; Forsyth, D. Coherent and controllable outfit generation. arXiv 2019, arXiv:1906.07273. [Google Scholar] [CrossRef]
Lin, A.; Zhao, N.; Ning, S.; Qiu, Y.; Wang, B.; Han, X. Fashiontex: Controllable virtual try-on with text and texture. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–9. [Google Scholar]
Liu, Y.; Tang, J.; Zheng, C.; Zhang, S.; Hao, J.; Zhu, J.; Huang, D. ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians. arXiv 2024, arXiv:2406.16815. [Google Scholar] [CrossRef]
Westerlund, M. The emergence of deepfake technology: A review. Technol. Innov. Manag. Rev. 2019, 9, 40–53. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Islam, T.; Miron, A.; Liu, X.; Li, Y. StyleVTON: A multi-pose virtual try-on with identity and clothing detail preservation. Neurocomputing 2024, 594, 127887. [Google Scholar] [CrossRef]
Chang, H.J.; Lee, H.Y.; Kim, M.J. Virtual reality in fashion: A systematic review and research agenda. Cloth. Text. Res. J. 2025. [Google Scholar] [CrossRef]
Chong, Z.; Dong, X.; Li, H.; Zhang, S.; Zhang, W.; Zhang, X.; Zhao, H.; Jiang, D.; Liang, X. Catvton: Concatenation is all you need for virtual try-on with diffusion models. arXiv 2024, arXiv:2407.15886. [Google Scholar] [CrossRef]
Prasetya, L.A.; Widiyawati, I.; Rofiudin, A.; Haq, S.T.N.; Hendranawan, R.S.; Permataningtyas, A.; Ichwanto, M.A. The use of CLO3D application in vocational school fashion expertise program: Innovations, challenges and recommendations. J. Res. Instr. 2025, 5, 287–299. [Google Scholar] [CrossRef]
Watson, A.; Alexander, B.; Salavati, L. The impact of experiential augmented reality applications on fashion purchase intention. Int. J. Retail Distrib. Manag. 2020, 48, 433–451. [Google Scholar] [CrossRef]
Häkkilä, J.; Colley, A.; Roinesalo, P.; Väyrynen, J. Clothing integrated augmented reality markers. In Proceedings of the 16th International Conference on Mobile and Ubiquitous Multimedia, Stuttgart, Germany, 26–29 November 2017; pp. 113–121. [Google Scholar]
Dacko, S.G. Enabling smart retail settings via mobile augmented reality shopping apps. Technol. Forecast. Soc. Change 2017, 124, 243–256. [Google Scholar] [CrossRef]
Javornik, A.; Rogers, Y.; Moutinho, A.M.; Freeman, R. Revealing the shopper experience of using a “magic mirror” augmented reality make-up application. In Proceedings of the Conference on Designing Interactive Systems, Brisbane, Australia, 4–8 June 2016; Association for Computing Machinery (ACM): New York, NY, USA, 2016; Volume 2016, pp. 871–882. [Google Scholar]
Flavián, C.; Ibáñez-Sánchez, S.; Orús, C. The impact of virtual, augmented and mixed reality technologies on the customer experience. J. Bus. Res. 2019, 100, 547–560. [Google Scholar] [CrossRef]
Juhlin, O.; Zhang, Y.; Wang, J.; Andersson, A. Fashionable services for wearables: Inventing and investigating a new design path for smart watches. In Proceedings of the 9th Nordic Conference on Human-Computer Interaction, Gothenburg, Sweden, 23–27 October 2016; pp. 1–10. [Google Scholar]
Rauschnabel, P.A.; Babin, B.J.; tom Dieck, M.C.; Krey, N.; Jung, T. What is augmented reality marketing? Its definition, complexity, and future. J. Bus. Res. 2022, 142, 1140–1150. [Google Scholar] [CrossRef]
Herz, M.; Rauschnabel, P.A. Understanding the diffusion of virtual reality glasses: The role of media, fashion and technology. Technol. Forecast. Soc. Change 2019, 138, 228–242. [Google Scholar] [CrossRef]
Han, X.; Hu, X.; Huang, W.; Scott, M.R. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10471–10480. [Google Scholar]
Ge, Y.; Song, Y.; Zhang, R.; Ge, C.; Liu, W.; Luo, P. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8485–8493. [Google Scholar]
Huang, Z.; Fan, H.; Wang, L.; Sheng, L. From parts to whole: A unified reference framework for controllable human image generation. arXiv 2024, arXiv:2404.15267. [Google Scholar] [CrossRef]
Honda, S. Viton-gan: Virtual try-on image generator trained with adversarial loss. arXiv 2019, arXiv:1911.07926. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Morelli, D.; Baldrati, A.; Cartella, G.; Cornia, M.; Bertini, M.; Cucchiara, R. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 8580–8589. [Google Scholar]
Fang, Z.; Zhai, W.; Su, A.; Song, H.; Zhu, K.; Wang, M.; Chen, Y.; Liu, Z.; Cao, Y.; Zha, Z.J. Vivid: Video virtual try-on using diffusion models. arXiv 2024, arXiv:2405.11794. [Google Scholar]
He, Z.; Chen, P.; Wang, G.; Li, G.; Torr, P.H.; Lin, L. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 123–139. [Google Scholar]
Li, D.; Zhong, W.; Yu, W.; Pan, Y.; Zhang, D.; Yao, T.; Han, J.; Mei, T. Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 22648–22657. [Google Scholar]
Nguyen, H.; Nguyen, Q.Q.V.; Nguyen, K.; Nguyen, R. Swifttry: Fast and consistent video virtual try-on with diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, 25 February–4March 2025; Volume 39, pp. 6200–6208. [Google Scholar]
Li, S.; Jiang, Z.; Zhou, J.; Liu, Z.; Chi, X.; Wang, H. Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency. arXiv 2025, arXiv:2501.08682. [Google Scholar]
Xu, Z.; Chen, M.; Wang, Z.; Xing, L.; Zhai, Z.; Sang, N.; Lan, J.; Xiao, S.; Gao, C. Tunnel try-on: Excavating spatial-temporal tunnels for high-quality virtual try-on in videos. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 3199–3208. [Google Scholar]
Zheng, J.; Zhao, F.; Xu, Y.; Dong, X.; Liang, X. Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers. arXiv 2024, arXiv:2405.18326. [Google Scholar]
Jiang, J.; Wang, T.; Yan, H.; Liu, J. Clothformer: Taming video virtual try-on in all module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10799–10808. [Google Scholar]
Kocabas, M.; Athanasiou, N.; Black, M.J. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5253–5263. [Google Scholar]
Dong, H.; Liang, X.; Shen, X.; Wu, B.; Chen, B.C.; Yin, J. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1161–1170. [Google Scholar]
Zhong, X.; Wu, Z.; Tan, T.; Lin, G.; Wu, Q. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 908–916. [Google Scholar]
Wang, Y.; Dai, W.; Chan, L.; Zhou, H.; Zhang, A.; Liu, S. Gpd-vvto: Preserving garment details in video virtual try-on. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7133–7142. [Google Scholar]
Zuo, T.; Huang, Z.; Ning, S.; Lin, E.; Liang, C.; Zheng, Z.; Jiang, J.; Zhang, Y.; Gao, M.; Dong, X. DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework. arXiv 2025, arXiv:2508.02807. [Google Scholar]
Song, D.; Li, T.; Mao, Z.; Liu, A.A. SP-VITON: Shape-preserving image-based virtual try-on network. Multimed. Tools Appl. 2020, 79, 33757–33769. [Google Scholar] [CrossRef]
Dong, H.; Liang, X.; Shen, X.; Wang, B.; Lai, H.; Zhu, J.; Hu, Z.; Yin, J. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9026–9035. [Google Scholar]
De Luigi, L.; Li, R.; Guillard, B.; Salzmann, M.; Fua, P. Drapenet: Garment generation and self-supervised draping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1451–1460. [Google Scholar]
Gundogdu, E.; Constantin, V.; Seifoddini, A.; Dang, M.; Salzmann, M.; Fua, P. Garnet: A two-stream network for fast and accurate 3d cloth draping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8739–8748. [Google Scholar]
Bertiche, H.; Madadi, M.; Escalera, S. Neural cloth simulation. ACM Trans. Graph. (TOG) 2022, 41, 220. [Google Scholar] [CrossRef]
Pfaff, T.; Fortunato, M.; Sanchez-Gonzalez, A.; Battaglia, P. Learning mesh-based simulation with graph networks. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
Bhatnagar, B.L.; Tiwari, G.; Theobalt, C.; Pons-Moll, G. Multi-garment net: Learning to dress 3d people from images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5420–5430. [Google Scholar]
Dong, J.; Fang, Q.; Huang, Z.; Xu, X.; Wang, J.; Peng, S.; Dai, B. Tela: Text to layer-wise 3d clothed human generation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 19–36. [Google Scholar]
Müller, M.; Heidelberger, B.; Hennix, M.; Ratcliff, J. Position based dynamics. J. Vis. Commun. Image Represent. 2007, 18, 109–118. [Google Scholar] [CrossRef]
Macklin, M.; Müller, M.; Chentanez, N. XPBD: Position-based simulation of compliant constrained dynamics. In Proceedings of the 9th International Conference on Motion in Games, Burlingame, CA, USA, 10–12 October 2016; pp. 49–54. [Google Scholar]
Jan, B.; Müller, M.; Macklin, M. A survey on position based dynamics. In Proceedings of the EG ’17: Proceedings of the European Association for Computer Graphics: Tutorials, Lyon, France, 24–28 April 2017; pp. 1–31. [Google Scholar]
Stomakhin, A.; Schroeder, C.; Chai, L.; Teran, J.; Selle, A. A material point method for snow simulation. ACM Trans. Graph. (TOG) 2013, 32, 102. [Google Scholar] [CrossRef]
Jiang, C.; Schroeder, C.; Teran, J.; Stomakhin, A.; Selle, A. The material point method for simulating continuum materials. In Proceedings of the ACM Siggraph 2016 Courses, Anaheim, CA, USA, 24–28 July 2016; pp. 1–52. [Google Scholar]
Guo, Q.; Han, X.; Fu, C.; Gast, T.; Tamstorf, R.; Teran, J. A material point method for thin shells with frictional contact. ACM Trans. Graph. (TOG) 2018, 37, 147. [Google Scholar] [CrossRef]
Hu, Y.; Fang, Y.; Ge, Z.; Qu, Z.; Zhu, Y.; Pradhana, A.; Jiang, C. A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM Trans. Graph. (TOG) 2018, 37, 150. [Google Scholar] [CrossRef]
Lv, A.; Zhu, Y.; Xian, C. Efficient cloth simulation based on the material point method. Comput. Animat. Virtual Worlds 2022, 33, e2073. [Google Scholar] [CrossRef]
Georgescu, S.; Chow, P.; Okuda, H. GPU acceleration for FEM-based structural analysis. Arch. Comput. Methods Eng. 2013, 20, 111–121. [Google Scholar] [CrossRef]
He, C.; Wang, Z.; Meng, Z.; Yao, J.; Guo, S.; Wang, H. Automated Task Scheduling for Cloth and Deformable Body Simulations in Heterogeneous Computing Environments. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, Vancouver, BC, Canada, 10–14 August 2025; pp. 1–11. [Google Scholar]
Marsden, J.E.; West, M. Discrete mechanics and variational integrators. Acta Numer. 2001, 10, 357–514. [Google Scholar] [CrossRef]
Fang, C.; Zhu, S.; Pan, J. Enhanced material point method with affine projection stabilizer for efficient hyperelastic simulations. Vis. Comput. 2025, 41, 6547–6560. [Google Scholar] [CrossRef]
Fei, Y.; Batty, C.; Grinspun, E.; Zheng, C. A multi-scale model for simulating liquid-fabric interactions. ACM Trans. Graph. (TOG) 2018, 37, 51. [Google Scholar] [CrossRef]
Va, H.; Choi, M.H.; Hong, M. Real-time cloth simulation using compute shader in Unity3D for AR/VR contents. Appl. Sci. 2021, 11, 8255. [Google Scholar] [CrossRef]
Kim, T.; Ma, J.; Hong, M. Real-Time Cloth Simulation in Extended Reality: Comparative Study Between Unity Cloth Model and Position-Based Dynamics Model with GPU. Appl. Sci. 2025, 15, 6611. [Google Scholar] [CrossRef]
Su, T.; Zhang, Y.; Zhou, Y.; Yu, Y.; Du, S. GPU-based Real-time Cloth Simulation for Virtual Try-on. In Proceedings of the PG (Short Papers and Posters), Hong Kong, 8–11 October 2018; pp. 1–2. [Google Scholar]
Li, C.; Tang, M.; Tong, R.; Cai, M.; Zhao, J.; Manocha, D. P-cloth: Interactive complex cloth simulation on multi-GPU systems using dynamic matrix assembly and pipelined implicit integrators. ACM Trans. Graph. (TOG) 2020, 39, 180. [Google Scholar] [CrossRef]
Schmitt, N.; Knuth, M.; Bender, J.; Kuijper, A. Multilevel Cloth Simulation using GPU Surface Sampling. Virtual Real. Interact. Phys. Simul. 2013, 13, 1–10. [Google Scholar]
Lan, L.; Lu, Z.; Long, J.; Yuan, C.; Li, X.; He, X.; Wang, H.; Jiang, C.; Yang, Y. Efficient GPU cloth simulation with non-distance barriers and subspace reuse. arXiv 2024, arXiv:2403.19272. [Google Scholar] [CrossRef]
Sung, N.J.; Ma, J.; Kim, T.; Choi, Y.j.; Choi, M.H.; Hong, M. Real-Time Cloth Simulation Using WebGPU: Evaluating Limits of High-Resolution. arXiv 2025, arXiv:2507.11794. [Google Scholar]
Yaakop, S.; Musa, N.; Idris, N.M. Digitalized Malay Traditional Neckline Stitches: Awareness and Appreciation of Malay Modern Dressmaker Community. In ASiDCON 2018 Proceeding Book; Universiti Teknologi MARA (UiTM): Shah Alam, Malaysia, 2018; p. 20. [Google Scholar]
Karzhaubayev, K.; Wang, L.P.; Zhakebayev, D. DUGKS-GPU: An efficient parallel GPU code for 3D turbulent flow simulations using Discrete Unified Gas Kinetic Scheme. Comput. Phys. Commun. 2024, 301, 109216. [Google Scholar] [CrossRef]
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
Arunkumar, M.; Gopinath, R.; Chandru, M.; Suguna, R.; Deepa, S.; Omprasath, V. Fashion Recommendation System for E-Commerce using Deep Learning Algorithms. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Mandi, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
Shirkhani, S.; Mokayed, H.; Saini, R.; Chai, H.Y. Study of AI-driven fashion recommender systems. SN Comput. Sci. 2023, 4, 514. [Google Scholar] [CrossRef]
Kachbal, I.; El Abdellaoui, S.; Arhid, K. Revolutionizing fashion recommendations: A deep dive into deep learning-based recommender systems. In Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security, Meknes, Morocco, 18–19 April 2024; pp. 1–8. [Google Scholar]
Kachbal, I.; Abdellaoui, S.E.; Arhid, K. Fashion Recommendation Systems: From Single Items to Complete Outfits. Int. J. Comput. Eng. Data Sci. (IJCEDS) 2025, 4, 27–40. [Google Scholar]
Kachbal, I.; Errafi, I.; Harmali, M.E.; El Abdellaoui, S.; Arhid, K. YOLO-GARNet: A High-Quality Deep Learning System for Garment Analysis and Personalized Fashion Recommendation. Eng. Lett. 2025, 33, 4448. [Google Scholar]
Landim, A.; Beltrão Moura, J.; de Barros Costa, E.; Vieira, T.; Wanick Vieira, V.; Bazaki, E.; Medeiros, G. Analysing the effectiveness of chatbots as recommendation systems in fashion online retail: A Brazil and United Kingdom cross-cultural comparison. J. Glob. Fash. Mark. 2025, 16, 295–321. [Google Scholar] [CrossRef]
Grewe, L.; Reddy, J.U.; Dasuratha, V.; Rodriguez, J.; Ferreira, N. FashionBody and SmartFashion: Innovative components for a fashion recommendation system. In Proceedings of the Signal Processing, Sensor/Information Fusion, and Target Recognition XXXIV, Orlando, FL, USA, 14–16 April 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13479, pp. 224–239. [Google Scholar]
Chen, W.; Huang, P.; Xu, J.; Guo, X.; Guo, C.; Sun, F.; Li, C.; Pfadler, A.; Zhao, H.; Zhao, B. POG: Personalized outfit generation for fashion recommendation at Alibaba iFashion. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2662–2670. [Google Scholar]
Suvarna, B.; Balakrishna, S. Enhanced content-based fashion recommendation system through deep ensemble classifier with transfer learning. Fash. Text. 2024, 11, 24. [Google Scholar] [CrossRef]
Xu, Y.; Wang, W.; Feng, F.; Ma, Y.; Zhang, J.; He, X. Diffusion models for generative outfit recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 1350–1359. [Google Scholar]
Gulati, S. Fashion Recommendation: Outfit Compatibility using GNN. arXiv 2024, arXiv:2404.18040. [Google Scholar] [CrossRef]
Hadi Kiapour, M.; Han, X.; Lazebnik, S.; Berg, A.C.; Berg, T.L. Where to buy it: Matching street clothing photos in online shops. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3343–3351. [Google Scholar]
Ge, Y.; Zhang, R.; Wang, X.; Tang, X.; Luo, P. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5337–5345. [Google Scholar]
Jiang, Y.; Yang, S.; Qiu, H.; Wu, W.; Loy, C.C.; Liu, Z. Text2human: Text-driven controllable human image generation. ACM Trans. Graph. (TOG) 2022, 41, 162. [Google Scholar] [CrossRef]
Feng, Y.; Lin, J.; Dwivedi, S.K.; Sun, Y.; Patel, P.; Black, M.J. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle WA, USA, 16–22 June 2024; pp. 2093–2103. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Li, P.; Noah, S.A.M.; Sarim, H.M. A survey on deep neural networks in collaborative filtering recommendation systems. arXiv 2024, arXiv:2412.01378. [Google Scholar] [CrossRef]
Liu, C.; Zhang, Y.; Wang, J.; Ying, R.; Caverlee, J. Flow Matching for Collaborative Filtering. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, Toronto, ON, Canada, 3–7 August 2025; pp. 1765–1775. [Google Scholar]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
He, X.; He, Z.; Song, J.; Liu, Z.; Jiang, Y.G.; Chua, T.S. NAIS: Neural attentive item similarity model for recommendation. IEEE Trans. Knowl. Data Eng. 2018, 30, 2354–2366. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Yaseen, M. What is YOLOv9: An in-depth exploration of the internal features of the next-generation object detector. arXiv 2024, arXiv:2409.07813. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Hsiao, W.L.; Grauman, K. Creating capsule wardrobes from fashion images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7161–7170. [Google Scholar]
Gao, C.; Zheng, Y.; Li, N.; Li, Y.; Qin, Y.; Piao, J.; Quan, Y.; Chang, J.; Jin, D.; He, X.; et al. A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Trans. Recomm. Syst. 2023, 1, 3. [Google Scholar] [CrossRef]
Wu, J.; He, X.; Wang, X.; Wang, Q.; Chen, W.; Lian, J.; Xie, X. Graph convolution machine for context-aware recommender system. Front. Comput. Sci. 2022, 16, 166614. [Google Scholar] [CrossRef]
Mei, L.; Ren, P.; Chen, Z.; Nie, L.; Ma, J.; Nie, J.Y. An attentive interaction network for context-aware recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 157–166. [Google Scholar]
Xin, X.; Chen, B.; He, X.; Wang, D.; Ding, Y.; Jose, J.M. CFM: Convolutional factorization machines for context-aware recommendation. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; Volume 19, pp. 3926–3932. [Google Scholar]
Rashed, A.; Elsayed, S.; Schmidt-Thieme, L. Context and attribute-aware sequential recommendation via cross-attention. In Proceedings of the 16th ACM Conference on Recommender Systems, Seattle, WA, USA, 18–23 September 2022; pp. 71–80. [Google Scholar]
Liu, Y.; Gao, Y.; Feng, S.; Li, Z. Weather-to-garment: Weather-oriented clothing recommendation. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 181–186. [Google Scholar]
Chen, X.; Chen, H.; Xu, H.; Zhang, Y.; Cao, Y.; Qin, Z.; Zha, H. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 765–774. [Google Scholar]
Celikik, M.; Wasilewski, J.; Mbarek, S.; Celayes, P.; Gagliardi, P.; Pham, D.; Karessli, N.; Ramallo, A.P. Reusable self-attention-based recommender system for fashion. In Proceedings of the Workshop on Recommender Systems in Fashion and Retail, Seattle, WA, USA, 18–23 September 2022; pp. 45–61. [Google Scholar]
Li, W.; Xu, B. Aspect-based fashion recommendation with attention mechanism. IEEE Access 2020, 8, 141814–141823. [Google Scholar] [CrossRef]
Ahmed, M.; Zhang, X.; Shen, Y.; Ali, N.; Flah, A.; Kanan, M.; Alsharef, M.; Ghoneim, S.S. A deep transfer learning based convolution neural network framework for air temperature classification using human clothing images. Sci. Rep. 2024, 14, 31658. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 12. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X. Explainable recommendation: A survey and new perspectives. Found. Trends^® Inf. Retr. 2020, 14, 1–101. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]

Figure 1. Proposed research taxonomy for fashion computer vision: A unified framework categorizing techniques into generative modeling, physical simulation, and recommendation systems.

Figure 2. PRISMA flow diagram showing the systematic literature search and selection process for computer vision techniques in fashion applications; * Records identified from databases and registers. ** Records excluded during title and abstract screening.

Figure 3. Appearance mixture fashion image synthesis.

Figure 4. Performance evolution and comparative analysis of clothing synthesis methods. (a) FID and Inception Score trends (2018–2025) showing 54% FID improvement. (b) Parameter efficiency scatter plot with temporal color gradient. (c) Resolution impact on average FID scores across four categories. (d) Multi-dimensional performance comparison evaluating overall quality, realism, and efficiency metrics across five state-of-the-art methods.

Figure 5. Performance and architectural analysis of 3D jewelry synthesis methods. (a) FID score progression showing 48% improvement from TreeGAN (2019, FID = 43.2) to NPCD (2024, FID = 22.6) across four methods spanning 2019–2024. (b) Architectural comparison of model parameters (blue bars) and encoding dimensions (orange bars, ×10 scale). Point-E achieves best quality with largest architecture (40 M parameters, 1024-dimensional encoding), while NPCD demonstrates optimal efficiency-performance balance (35 M parameters, 768-dimensional encoding, FID = 22.6).

Figure 6. Taxonomy of outfit generation methods. Color coding distinguishes architectural paradigms: sequential modeling (yellow), embedding-based approaches (blue), and graph neural network models (green).

Figure 7. Outfit generation performance comparison across methodological approaches, including Cui et al. (NGNN) [46], Vasileva et al. (CSN) [76], and Han et al. (Bi-LSTM) [75].

Figure 8. Visual comparison of virtual try-on methods: Image-based approaches (a) and exploring realistic clothing deformation through advanced cloth simulation across diverse body shapes (b).

Figure 9. Performance comparison of 3D virtual try-on methods across four key metrics. (a) Body fit accuracy measured by SMPL alignment error and pose estimation error. (b) Surface distance quality via Chamfer Distance, showing 38% improvement from DrapeNet to Neural Cloth Sim. (c) Maximum deviation trends using Hausdorff Distance, demonstrating 47% reduction. (d) Wrinkle realism assessment through MSE, with Neural Cloth Sim achieving the lowest error (0.042). All metrics show consistent improvement toward more accurate and realistic garment simulation.

Figure 10. Examples of outfit matching task (a) and a comparison between product-based and scene-based recommendation (b).

Figure 11. Overview of research trends in fashion recommendation: (a) evolution of deep learning architectures (2015–2025), (b) comparative model performance across major fashion tasks, and (c) growth of multimodal datasets from image-based to text-integrated representations.

Figure 12. Performance comparison of recommendation methods. Lighter-colored bars denote traditional approaches (MF, SVD), while darker-colored bars represent deep learning–based methods (NCF, BiLSTM, GNN, Transformer).

Table 1. Overview of major publicly available datasets commonly used in clothing synthesis research, including dataset size, content type, and key characteristics.

Dataset	Year	Images	Category	Details
DeepFashion [32]	2016	800 k+	Multi-Category	Large-scale dataset fashion with rich annotations
Inshop [32]	2016	52 k	Clothes	Consumer-to-shop retrieval benchmark
Fashion-MNIST [33]	2017	70 k	Clothes	Grayscale clothing item images
Fashion-Gen [35]	2018	293 k	Clothes	Fashion images with text descriptions
ModaNet [37]	2018	55 k	Clothes	Street fashion with polygon annotations
CLOTH3D [34]	2020	100 k	3D Clothes	3D garment meshes with simulation data
Fashionpedia [36]	2020	48 k	Clothes	Fine-grained fashion attribute dataset
VITON-HD [3]	2021	13 k	Clothes	High-resolution virtual try-on dataset
DressCode [41]	2022	53 k	Multi-Category	Multi-category virtual try-on
4D-DRESS [42]	2024	78 k	4D Real Clothing	Real-world 4D textured scans

Table 2. State-of-the-art methods for makeup analysis and synthesis applications.

Method	Task	Dataset	Performance Metric
Chen et al. [59]	Makeup transfer	MIT (>100 images)	Visual Quality
CNN Ensemble [60]	Face beauty prediction	SCUT-FBP5500	91.2% (MAE)
Stable-Makeup [61]	Diffusion transfer	Real-world makeup	Transfer Quality
SHMT [62]	Self-supervised transfer	Hierarchical decomposition	Superior Fidelity

Table 3. Obfuscation Attack Success Rate (%) and Structural Similarity for Various Face Recognition Models.

Model	AdvFaces [66]	GFLM [68]	PGD [69]	FGSM [70]
FaceNet [71]	99.67	23.34	99.70	99.96
SphereFace [72]	97.22	29.49	99.34	98.71
ArcFace [73]	64.53	03.43	33.25	35.30
COTS-A	82.98	08.89	18.74	32.48
COTS-B	60.71	05.05	01.49	18.75
Structural Similarity	0.95 ± 0.01	0.82 ± 0.12	0.29 ± 0.06	0.25 ± 0.06
Computation Time (s)	0.01	3.22	11.74	0.03

Table 4. Methodological landscape of computational outfit generation: Technical approaches and innovation patterns.

Method	Model	Approach	Unique Aspects
SeqGAN [75]	Bi-LSTM	Sequential	Visual-semantic embedding
StyleNet [79]	Bi-LSTM + AE	Sequential + Style	Style extraction via autoencoder
TypeAware [76]	CSN	Type-aware	Category-specific embeddings
NGNN [78]	NGNN	Graph-based	Node-wise graph networks

Table 5. Performance Metrics in Fashion Style Transfer and Generation Papers.

Paper	Year	Evaluation Metrics	Dataset	Key Contribution
[82]	2019	Human evaluation	Web photos	Minimal outfit edits
[86]	2019	Visual quality	Polyvore	Attribute-based generation
[84]	2020	Inception Score	Fashion dataset	Multi-domain translation
[83]	2021	SSIM, LPIPS	Polyvore	Compatibility-regularized GAN

Table 6. Fashion Generation Models and Control Approaches.

Paper	Year	Model	Control Mechanism	Control Type
[87]	2019	FineGAN	Hierarchical disentanglement	Localized
[88]	2019	Multimodal embedding	Text-image coherence	Theme-based
[89]	2023	FashionTex	Text and texture	Attribute-based
[90]	2024	DCGS	3D Gaussian splatting	Full garment

Table 7. Virtual Try-On Datasets and Methods Information.

Method/Dataset	Task	Key Innovation
DeepFashion [32]	Clothes recognition	Rich annotations
VITON [2]	Virtual try-on	Person representation
CP-VTON [43]	Characteristic-preserving	Geometric matching
ClothFlow [106]	Flow-based try-on	Appearance flow

Table 8. Evolution of virtual try-on methodologies: From conditional generation to flow-based appearance transfer and semantic-guided synthesis.

Method	Year	Input	Architecture	Key Contributions
CAGAN [27]	2017	Cloth. + Person Images	Conditional GAN	Fashion article swapping
VITON [2]	2018	Cloth. + Parsing Maps	Coarse-to-fine	Clothing-agnostic representation
CP-VTON [43]	2018	Cloth. + Person Images	GMM + TOM	Characteristic preservation
ClothFlow [106]	2019	Cloth. + Person Images	Flow-based	Appearance flow estimation
ACGPN [40]	2020	Cloth. + Semantic Layout	Attention + GAN	Semantic-guided generation
VITON-GAN [109]	2019	Cloth. + Person Images	Adversarial Training	Occlusion handling

Table 9. Temporal dynamics and consistency preservation in video-based fashion applications.

Method	Year	Input	Type
FW-GAN [121]	2019	Video	Flow-Guided Warping + GAN
VIBE [120]	2020	Video	Temporal Motion Discriminator + 3D Estimation
ClothFormer [119]	2022	Video	Appearance-Flow Tracking + Vision Transformer
WildVidFit [113]	2024	Video	Diffusion Guidance + VideoMAE Consistency
DPIDM [114]	2025	Video	Dynamic Pose Interaction + Diffusion Models
RealVVT [116]	2025	Video	Spatio-Temporal Attention + Foundation Models

Table 10. Quantitativeperformance assessment of virtual try-on methodologies across detection accuracy, perceptual quality, and image fidelity metrics. ↑ indicates that higher values correspond to better performance, whereas ↓ indicates that lower values correspond to better performance.

Method	Accessory Detection	Pose Estimation	SSIM ↑	MSE ↓	LPIPS ↓	FID ↓
VITON [2]	78%	83%	0.65	0.021	0.21	45.2
CP-VTON [43]	86%	89%	0.69	0.017	0.18	39.1
SP-VITON [125]	80%	85%	0.72	0.015	0.15	32.5
MG-VTON [126]	82%	87%	0.75	0.012	0.12	28.1
SwiftTry [115]	88%	91%	0.78	0.009	0.10	24.9

Table 13. Comparison of Cloth Simulation Methods in VR/AR.

Method	Performance (FPS)	Mesh Resolution	Platform
Mass-Spring [146]	60+	50 × 50 vertices	Mobile VR
Unity Cloth [147]	45–60	32 × 32 vertices	Meta Quest 3
GPU-PBD [147]	60+	64 × 64 vertices	Meta Quest 3
Parallel GPU [154]	60+	7K triangles	Desktop VR

Table 14. Single-Item Recommenders Datasets and Their Characteristics.

Dataset	Year	Size	Task	Accuracy/Performance
Street2Shop [167]	2015	404k shop, 20k street	Retrieval	MAP > 70%
DeepFashion [32]	2016	800k images	Classification	85% Top-1
Fashion-Gen [35]	2018	293k images	Generation	FID scores
ModaNet [37]	2018	55k images	Segmentation	mIoU varies by class
DeepFashion2 [168]	2019	491k images	Multi-task	Detection mAP > 60%
DeepFashion-MultiModal [169]	2022	44k images	Multi-modal	Various tasks

Table 15. Neural Network Approaches in Fashion Recommendation Systems.

Method	Year	Dataset	Key Features	Performance
DeepFashion [32]	2016	DeepFashion	CNN-based recognition	Various tasks
NCF [176]	2017	MovieLens	Neural collaborative filtering	HR@10: 0.409
Fashion Compatibility [75]	2017	Polyvore	Bidirectional LSTM	AUC: 0.9
NAIS [177]	2018	Amazon	Attentive item similarity	HR@10: 0.686
FashionNet [6]	2018	Polyvore	CNN + MLP	Compatibility scores
POG [163]	2019	Alibaba	Personalized outfit generation	Industrial deployment

Table 16. Object Detection Methods in Computer Vision.

Method	Description
Faster R-CNN [171]	Two-stage detector using Region Proposal Network and bounding box/class prediction.
Mask R-CNN [172]	Extends Faster R-CNN with mask prediction branch for instance segmentation.
YOLO [178]	Single-shot detector predicts boxes and classes per grid cell.
YOLOv3 [179]	Improved YOLO with Darknet-53 backbone and multi-scale predictions.
YOLOv8 [180]	Improved architecture, anchor-free design, and enhanced training strategies.
YOLOv9 [181]	Introduces PGI and GELAN architectures.
YOLOv11 [182]	Improved accuracy-speed tradeoff and advanced feature extraction capabilities.

Table 17. Climate-Adaptive and Weather-Aware Fashion Recommendation Methods.

Method	Dataset	Key Evaluation Metric
WoG [193]	Clothing with weather data	Weather suitability
Multimodal Attention [194]	Fashion images with reviews	Visual explanation accuracy
AFRA [195]	Multi-entity fashion data	Recommendation precision
AFRAM [196]	E-commerce reviews/ratings	Rating prediction accuracy

Table 18. Open-Source Implementations and Pre-trained Models.

Method	Year	Task	Code Repository	Model
Generative Fashion
VITON [2]	2018	Virtual Try-On	https://github.com/xthan/VITON (accessed on 10 December 2025)	✓
CP-VTON [43]	2018	Characteristic-Preserving	https://github.com/sergeywong/cp-vton (accessed on 10 December 2025)	✓
ACGPN [40]	2020	Pose-Guided Try-On	https://github.com/switchablenorms/DeepFashion_Try_On (accessed on 10 December 2025)	✓
VITON-HD [3]	2021	High-Resolution Try-On	https://github.com/shadow2496/VITON-HD (accessed on 10 December 2025)	✓
TryOnGAN [47]	2021	Layered Interpolation	https://github.com/ofnote/TryOnGAN (accessed on 10 December 2025)	✓
StyleGAN-Human [49]	2022	Human Generation	https://github.com/stylegan-human/StyleGAN-Human (accessed on 10 December 2025)	✓
Simulative Fashion
VIBE [120]	2020	3D Motion Estimation	https://github.com/mkocabas/VIBE (accessed on 10 December 2025)	✓
ClothFormer [119]	2022	Video Try-On	Code available on request	×
Neural Cloth Sim [129]	2022	Physics Simulation	https://github.com/hbertiche/NeuralClothSim (accessed on 10 December 2025)	✓
DrapeNet [127]	2023	Garment Draping	https://github.com/liren2515/DrapeNet (accessed on 10 December 2025)	✓
VIVID [112]	2024	Video Virtual Try-On	https://github.com/sharkdp/vivid (accessed on 10 December 2025)	×
VITON-DiT [118]	2024	Diffusion Transformers	https://github.com/ZhengJun-AI/viton-dit-page (accessed on 10 December 2025)	×
SwiftTry [115]	2025	Fast Video Try-On	https://github.com/VinAIResearch/SwiftTry (accessed on 10 December 2025)	×
RealVVT [116]	2025	Photorealistic Video	Code available on request	×
DPIDM [114]	2025	Dynamic Pose Interaction	Code available on request	×
Recommender Fashion
Fashion Compatibility [75]	2017	BiLSTM Sequential	https://github.com/xthan/polyvore (accessed on 10 December 2025)	×
NGNN [78]	2019	Graph-Based Outfit	https://github.com/CRIPAC-DIG/NGNN (accessed on 10 December 2025)	×
POG [163]	2019	Outfit Generation	Proprietary (Alibaba iFashion)	×

Note: ✓ indicates that a pre-trained model is publicly available; × indicates that no pre-trained model is publicly available.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kachbal, I.; El Abdellaoui, S. Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations. Information 2026, 17, 11. https://doi.org/10.3390/info17010011

AMA Style

Kachbal I, El Abdellaoui S. Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations. Information. 2026; 17(1):11. https://doi.org/10.3390/info17010011

Chicago/Turabian Style

Kachbal, Ilham, and Said El Abdellaoui. 2026. "Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations" Information 17, no. 1: 11. https://doi.org/10.3390/info17010011

APA Style

Kachbal, I., & El Abdellaoui, S. (2026). Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations. Information, 17(1), 11. https://doi.org/10.3390/info17010011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Computer Vision for Fashion: A Systematic Review of Design Generation, Simulation, and Personalized Recommendations

Abstract

1. Introduction

2. Methodology Overview

2.1. Research Questions

2.2. Search Strategy

2.3. Study Selection

2.4. Inclusion and Exclusion Criteria

2.5. Data Extraction

2.6. Quality Assessment

2.7. Data Synthesis

3. Generative Fashion

3.1. Fashion Synthesis

3.1.1. Clothing Synthesis

3.1.2. Jewelry Synthesis

3.1.3. Facial Synthesis

3.2. Outfit Generation

3.2.1. Outfit Generative Models

3.2.2. Outfit Style Transfer Models

3.3. Challenges, Limitations, and Future Directions

4. Simulative Fashion

4.1. Virtual Try-On

4.1.1. Image-Based Methods

4.1.2. Video-Based Methods

4.1.3. 3D Model-Based Methods

4.2. Cloth Simulation

4.2.1. Physics-Based Modeling

4.2.2. Real-Time Simulation

4.3. Challenges and Future Outlook

5. Recommender Fashion

5.1. Outfit Learning Recommendation

5.1.1. Single-Item Recommenders

5.1.2. Outfit Style Recommenders

5.2. Context-Aware Recommendation

5.2.1. Occasion-Based Recommendation Systems

5.2.2. Climate-Adaptive Fashion Recommendation

5.3. Open Challenges and Future Research Directions

6. Open-Source Resources and Pre-Trained Models

6.1. Key Datasets with Public Access

6.2. Pre-Trained Models for Transfer Learning

6.3. Computational Requirements

7. Cross-Domain Integration and Synergies

7.1. Generative-Simulative Integration

7.2. Generative-Recommender Integration

7.3. Simulative-Recommender Integration

7.4. Tri-Domain Integration

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI