AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing

Zheng, Xiaohu; Li, Xiechen; Liu, Bing; Xu, Bingshun

doi:10.3390/electronics15081751

Open AccessArticle

AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing

by

Xiaohu Zheng

^1,*,

Xiechen Li

²,

Bing Liu

³ and

Bingshun Xu

³

¹

Institute of Artificial Intelligence, Shanghai Engineering Center of Industrial Big Data and Intelligent System, Engineering Research Center of Artificial Intelligence, Donghua University, Shanghai 201620, China

²

College of Information Science and Technology, Donghua University, Shanghai 201620, China

³

Hangzhou Zhongfu Technology & Innovation Research Institute Co., Ltd., Hangzhou 311199, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1751; https://doi.org/10.3390/electronics15081751

Submission received: 9 March 2026 / Revised: 6 April 2026 / Accepted: 7 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue 2D/3D Industrial Visual Inspection and Intelligent Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Clothing pattern-making is an important part of transforming design concepts into finished products; however, the traditional manual pattern-making process is not only time-consuming, but also suffers from inefficiency, which seriously restricts the automation and precision of clothing production. This study proposes an automated clothing pattern-making method, the core of which lies in the organic combination of an improved Pix2Pix model and a conditional diffusion model. The improved Pix2Pix model effectively captures the complex structural information in clothing patterns by introducing a multi-scale discriminator and a new composite loss function. Due to limited data, the improved Pix2Pix falls short in terms of image generation quality, so a conditional diffusion model was introduced to enhance the detail and overall integrity of the generated images. Experiments were conducted on pattern-making tasks for the sleeves and back panels of various typical clothing styles. The sleeve components primarily validated the model’s basic generation capabilities. The results showed that the improved Pix2Pix-generated initial template could capture the basic contour structure, and after diffusion model repair, the lines became clearer and the details more complete; the back panels components validated the model’s robustness. Quantitative results showed that the proposed method achieved SSIM, PSNR, and LPIPS values of 0.869, 22.31, and 0.1318, respectively. Compared with the results of other advanced models, the proposed method exhibits the highest accuracy and clarity in the generated images, confirming its practicality and effectiveness in automated apparel pattern-making.

Keywords:

clothing pattern generating; Pix2Pix; diffusion model; image quality repair

1. Introduction

In recent years, with the rapid development of artificial intelligence technologies such as reinforcement learning and deep learning, their application in the apparel industry has been increasingly explored. However, existing research has mostly focused on the apparel design process [1,2,3], while research on the application of these technologies to the critical process of pattern-making has been relatively scarce. Pattern-making is a critical link between design concepts and finished garment production. Its quality directly determines the final product’s shape, fit, and aesthetics. Figure 1a illustrates the traditional process from design sketches to pattern drafts to the final garment. Traditional pattern-making involves transforming a designer’s style requirements into two-dimensional flat patterns suitable for cutting and sewing. This is achieved through analyzing the dimensions, shapes, and structures of various parts of the garment. The creation of pattern drafts is a complex and meticulous process. A single garment often requires multiple patterns to accommodate different body types and fit requirements. Each pattern involves dozens of parameters, including dimensions, proportions, and the placement of structural lines. Designers must make precise adjustments to these parameters, which demands significant time and effort. Particularly, the outline must be redrawn from scratch for each new pattern draft. Statistics show that the pattern-making stage typically occupies a substantial proportion of the entire fashion design and production process, accounting for as much as 20–40% of the time. According to industry reports, the pattern-making stage can account for 20–40% of the total product development time in fashion supply chains [4]. Moreover, studies indicate that manual pattern drafting for a single style may require 2–4 h of skilled labor, significantly limiting production agility [5].

Digital pattern-making technology provides a feasible solution to these problems, and the current digital methods are mainly divided into three categories. One is parametric conversion, which refers to the design process based on algorithmic thinking, and defines, encodes, and clarifies the relationship between the design intent and the design response through the expression of parameters and rules. Li et al. [6] modeled and optimized the key curves of knitted jumpers (e.g., neckline curve, sleeve hole curve and sleeve hill curve) through a parametric design approach, and proposed a mathematical model constraint method based on circular arc fitting and Bessel curves, which significantly improves the efficiency and accuracy of jumper design. Another category is pattern library matching, in which preliminary patterns are generated by matching the design drawings with the existing templates in the database, followed by manual adjustment and optimization. Liu et al. [7] proposed an improved apparel structural feature recognition and similar sample matching technique based on the AlexNet convolutional neural network, which realized intelligent matching from apparel style drawings to samples by identifying 18 fine-grained features of women’s pants, with an accuracy rate of 83.4% on the validation set. However, these methods still rely on manual intervention and are difficult to meet the demand for full automation, especially in the processing of complex designs. The third category is 3D-based pattern-making technology, which extracts structural information from 3D garment point clouds or models to reconstruct 2D patterns. Korosteleva and Lee [8] proposed the NeuralTailor framework, which uses point-level attention mechanisms and recurrent neural networks to reconstruct sewing pattern structures from 3D garment point clouds, enabling generalization to garment types with unseen pattern topologies. While NeuralTailor demonstrates promising generalization to unseen pattern topologies, the generated patterns are primarily limited to simplified contour representations. In practical apparel manufacturing, pattern drafts must incorporate intricate details such as seam allowances, notches, and internal grain lines, which remain challenging for current 3D-to-2D reconstruction approaches. In practical applications, only simple components such as t-shirt sleeves can be generated as references. Combining the 3D point cloud data and the 2D pattern generation process yields a simple component pattern generation process, as shown in Figure 1b.

At the same time, significant progress has been made in the field of apparel automation design with the rapid development of generative models such as Generative Adversarial Network (GAN) [9,10,11,12,13,14,15]. The Pix2Pix model [16], as a specific implementation of GAN, focuses on the task of image translation, i.e., the conversion of one image form into another image form [17]. Although Pix2Pix performs well in many image-to-image translation tasks, it faces notable limitations when applied to complex structures like garment patterns, which involve fine edges and long-range dependencies [18]. These limitations arise from three aspects. First, the U-Net generator struggles to capture the long-range spatial dependencies essential for global structural coherence. Second, the PatchGAN discriminator only evaluates local patches, lacking global structural guidance. Third, the combination of L1 and adversarial losses focuses on pixel-level reconstruction rather than structure-level semantics such as contour continuity and precise edge alignment. These factors motivate the need for architectural enhancements and more structured loss functions in complex pattern generation tasks. Diffusion Probabilistic Models (DPMs), on the other hand, have become an effective tool for the problem of generating image detail restoration by gradually transforming the input data through a multi-step process of adding and denoising to generate new data distributions [19,20]. Compared with traditional GANs, DPMs exhibit higher stability in the generation process and are better able to retain the detail information of complex images [21]. However, in scenarios with limited datasets, they are prone to issues such as unstable generation and missing details, making it difficult to fully leverage their performance.

Although existing digital pattern-making methods have made progress in parametric design, template matching, and 3D-based data reconstruction, they still face three key challenges when applied to automated clothing pattern generation.

First, the complex structural information in clothing patterns (such as cutting lines, darts, and contour curves) is difficult to capture using standard image translation models like Pix2Pix, as such models often struggle with structural coherence and fine edge details.

Second, single-stage generative models are often constrained by small-scale datasets, which are common in specialized clothing manufacturing scenarios. Although diffusion models excel in detail restoration, they tend to suffer from training instability and detail loss when data is insufficient.

Third, existing methods rarely balance global structural consistency with local detail realism during pattern generation, especially in the absence of sufficient supervision.

To address these challenges, this paper proposes a hybrid framework that combines an improved Pix2Pix model with a conditional diffusion model. Specifically, we enhance the Pix2Pix architecture by introducing a multi-scale discriminator and a composite structure-aware loss function to improve structural capture capability for complex clothing patterns. We design a two-stage generation-and-refinement pipeline, in which the improved Pix2Pix first establishes rough structural correspondences under limited data conditions, and the conditional diffusion model subsequently refines the details to overcome the instability of diffusion models in small-data settings. We validate the framework on components of varying complexity (sleeves and back panels), demonstrating its effectiveness in both basic structure generation and sensitivity to details, which shows the model’s adaptability to components of different complexity levels. The evaluation uses metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Learned Perceptual Image Patch Similarity (LPIPS) to assess the structural similarity and detail precision.

The remainder of this paper is organized as follows. Section 2 reviews related work on GAN-based image generation and diffusion models. Section 3 presents the proposed methodology, including the improved Pix2Pix model and the conditional diffusion model for detail refinement. Section 4 describes the experimental setup, dataset, and evaluation metrics, followed by quantitative and qualitative results on both sleeve and back panel generation tasks, along with ablation studies and comparisons with state-of-the-art methods. Section 5 concludes the paper with a summary of contributions, limitations, and future work.

2. Related Work

2.1. GAN Image Generation

GANs consist of a generator and a discriminator, where the generator generates realistic, high-quality images, while the discriminator learns to discriminate the difference between the generated image and the real image; through the process of training the two against each other, it not only pushes the generator’s generative ability, but also makes the discriminator reach the limit of its discriminative ability, so it is used in image generation, style transformation and other fields. Since Goodfellow et al. [22] proposed GAN in 2014, various derived GAN models have been proposed successively, and the innovative aspects of these models include model structure improvements [23,24], theoretical extensions [25,26], and application explorations [27].

In the field of fashion design, some studies have utilized GANs to generate apparel design sketches or convert design sketches into finished images, providing designers with more convenient design tools. Such studies usually focus on the image-to-image conversion task, e.g., Cui. et al. [11] proposed an end-to-end virtual apparel design presentation method based on Conditional Generative Adversarial Networks, which is able to quickly and automatically generate virtual apparel images with consistent shapes and textures based on fashion sketches and fabric images. Liu. et al. [10] constructed an Attribute-GAN model to realize the automatic generation of clothing matching image pairs based on the semantic attributes of clothing, which not only enhances the visual consistency of clothing matching, but also provides a new method for the automation of clothing design.

Although methods based on GAN, such as FashionGAN and Attribute-GAN, have achieved excellent results in the field of clothing design visualization, their applicability in pattern-making applications is fundamentally constrained by three core factors. First, the objective functions of these models only optimize for perceptual similarity (such as LPIPS, FID metrics) and visual realism, and have no direct correlation with structural line accuracy. If a pattern’s FID metric performs well, but the stitching offset is 2 mm, it cannot be used for actual production. Second, these models process RGB color images containing rich texture information; while pattern-making requires binary line drafts, where each pixel has structural significance, this data difference requires a completely different network structure design approach. Third, existing GAN models in the fashion field usually rely on large-scale datasets for training (such as the DeepFashion dataset containing over 200,000 images), while the professional annotated data in the pattern-making field is scarce and cannot reach the same data scale.

Pix2Pix is an image-to-image translation model based on conditional GANs (cGANs), which is characterized by input–output pairs with explicit correspondences. Inspired by the success of Pix2Pix in image-to-image translation, we combine Pix2Pix with apparel pattern-making, and utilize the image-to-image conversion capability of Pix2Pix to directly convert the designer’s sketches into their corresponding pattern-makings. This paper’s framework effectively addresses all these pain points by designing task-specific loss functions, constructing multi-scale discriminators, and adopting a data-efficient two-stage training strategy.

2.2. Quality Repairing for Diffusion Model

The diffusion model was first proposed by Ho et al. [19] and its core idea is to convert the data distribution to Gaussian distribution through a forward process of adding noise step by step, and then gradually denoising to recover the original image by learning a reverse process, i.e., a “diffusion–anti-diffusion” mechanism. However, in its original formulation, the model exhibits limited flexibility in incorporating conditional inputs (such as design sketches) and suffers from poor adaptability in small-data scenarios, where training becomes unstable and fine structural details are often lost. To overcome these limitations, Song et al. [28] proposed a score-based generative modeling method based on stochastic differential equations (SDEs) and introduced a continuous-time process, which laid the foundation for conditional diffusion models. Up to this point, the diffusion model has shown excellent performance in the tasks of image denoising, deblurring, super-resolution, and image restoration. Saharia et al. [29] proposed the SR3 model, which achieves image mapping from low to high resolution by gradually reducing the noise in the image and generates high-quality images with rich details. The SR3 model demonstrates the ability of the diffusion model in recovering fine structures from blurred low-resolution images and significantly improves the visual quality of the images. Lugmayr et al. [30] proposed the RePaint model, which utilizes the denoising capability of DDPM to reconstruct the complete image by gradually recovering details in damaged or missing image regions.

The above study shows that the diffusion model can not only be applied to standard image restoration tasks such as super-resolution and denoising, but also demonstrates the ability to generate high-quality images in the face of complex image occlusion problems, utilizing its denoising process to gradually reconstruct the corrupted or blurred image details, thus effectively enhancing the overall visual effect of the image, and becoming an important tool in the image restoration task. Since the images generated by the Pix2Pix model are deficient in detail presentation, this study combines Pix2Pix with the diffusion model to fully utilize the advantages of the diffusion model in image restoration and compensate for the lack of image details.

2.3. GAN and Diffusion Model Fusion Paradigms

Recent studies have explored the synergistic integration of GANs and diffusion models to leverage the strengths of both paradigms. The prevailing fusion strategies can be categorized into two types: sequential coarse-to-fine generation and parallel latent space disentanglement. In the sequential paradigm, GANs generate initial coarse outputs that subsequent diffusion models refine—this architecture has been applied to tasks such as image super-resolution and face restoration. In the parallel paradigm, exemplified by FS-Control and CAT-DM, GAN discriminators guide diffusion sampling or accelerate inference through implicit distribution initialization. ColorwAI further demonstrates GAN–diffusion disentanglement for controlled textile color variation.

However, existing GAN–diffusion fusion methods exhibit two critical limitations when applied to clothing pattern generation: (1) Line structure degradation—most methods prioritize photorealistic texture generation over precise contour preservation, leading to blurred or broken structural lines in pattern outputs; (2) Data inefficiency—these methods typically require large-scale paired datasets (often >10,000 samples) to achieve stable training, which is impractical in pattern-making scenarios where professional annotations are scarce.

In contrast, our framework addresses these gaps through three task-specific innovations: (1) a composite structure-aware loss combining Sobel edge loss and Block IoU loss to enforce line continuity; (2) a multi-scale discriminator architecture that evaluates both global pattern coherence and local edge sharpness; (3) a two-stage training strategy that uses improved Pix2Pix outputs as augmented training samples for the diffusion model, mitigating data scarcity. To our knowledge, this is the first GAN–diffusion fusion framework specifically designed for garment pattern generation that explicitly optimizes for structural line fidelity under limited supervision.

2.4. Discussion of Recent Garment Generation Models

Recently, several studies have made progress in generating clothing patterns. For instance, SewingLDM [31] uses a latent diffusion model to generate complex sewing patterns, while GarmentDiffusion [32] employs a diffusion converter to generate three-dimensional clothing sewing patterns. Aiparel [33] is a multimodal base model trained on over 120,000 clothing samples. Although these methods are related, in our experimental setup, conducting a quantitative comparison directly with these methods is limited by three factors: (1) task focus—these models mainly focus on 3D pattern restoration or generation based on text conditions, while our work focuses on generating 2D pattern drafts from design sketches under limited data conditions; (2) data scale—these methods require large-scale datasets, which are impractical in professional drafting scenarios; (3) reproducibility—the code and pretrained models of these works have not been made public in this study. Therefore, our quantitative comparison focuses on representative models that are directly applicable to our task and provides a fair assessment.

3. Method

3.1. Improved Pix2Pix Model

This study proposes an improved Pix2Pix model. It learns the mapping from a conditional image x and random noise vector z to y. The model uses garment design sketches as conditions. It learns the mapping from pairs of design sketches and pattern drafts. The generator G and discriminator network D perform feedforward inference based on the design sketches. This model generates initial pattern sketches for garment sleeves or back panels. The overall architecture of the improved model is shown in Figure 2. The generator takes clothing design sketches as input to produce initial patterns, while the discriminators evaluate the generated images against ground truth. Ground truth patterns are paired with design sketches in the dataset, and during training, the discriminators compare generated images with ground truth to enforce structural consistency. The final discrimination integrates local (D1) and global (D2) evaluations to guide generator optimization.

As shown in Figure 3 for the generator of the proposed model, the generator adopts a U-Net [34]-based architecture, which gradually converts the input clothing design image into a clothing pattern through an encoder–decoder structure. The encoder part gradually extracts high-level features through a series of convolutional down-sampling operations, while the decoder gradually recovers the details of the image through inverse convolutional up-sampling. Hopping connections are used to transfer detail information between the encoder and decoder, ensuring that the resulting clothing patterns maintain clear edges and structures. Meanwhile, a self-attention mechanism is added to the last layer of the generator encoder. This position is chosen because the encoder’s last layer corresponds to the deepest feature map, which has the smallest spatial resolution but the largest receptive field. At this stage, the feature maps already encode high-level semantic information while preserving the global structural layout of the clothing pattern. Adding self-attention at this layer allows the model to effectively capture long-range dependencies—such as the spatial relationship between the sleeve cap and cuff or between the back center line and side seams—without introducing excessive computational overhead. In contrast, inserting self-attention at earlier layers would involve higher-resolution feature maps, significantly increasing computational cost while providing limited benefits for global structural coherence. Therefore, this design strikes a favorable trade-off between global modeling capability and computational efficiency, ensuring that the generated patterns maintain both structural integrity and fine detail.

In the proposed model, the idea of Pix2Pix HD [35] was drawn upon, a multi-scale discriminator was introduced, and the local and global feature extraction and evaluation were carried out by discriminators D1 and D2 respectively. Discriminator D1 acts as a local discriminator, based on the PatchGAN architecture, and performs fine-grained discrimination on each small region of the input image to ensure that the generated clothing pattern is consistent with the real image in terms of texture, edges, and other details. The output of the PatchGAN is a smaller feature map, with each element representing a local region of the image for realism assessment. The discriminator D2 acts as a global discriminator and evaluates the whole image at once, capturing the overall structure and semantic consistency of the image. D2 obtains the global information by gradual down-sampling and outputs the global trueness discrimination result. Through this multi-scale discrimination mechanism, the combination of the local PatchGAN discriminator and the global discriminator enables a more comprehensive evaluation of the generated image’s quality. D1 ensures that the generated image is realistic in local details, while D2 ensures the global structural coherence of the image. This guides the generator to produce high-quality pattern drafts that are rich in detail and realistic overall.

Since traditional Pix2Pix can only achieve pixel-level difference generation, but not structure-level difference expression such as clothing patterns generation, we design a multi-scale structure-aware loss as in Equation (1). The combination of five loss functions ensures that the generator accounts for both the global structure and local details of the image, and uses weighting coefficients to adjust the relative importance between different losses to achieve structural alignment of the image.

L_{EP} = L_{A} + λ_{2} L_{2} + λ_{VGG} L_{VGG} + λ_{S o b e l} L_{S o b e l} + λ_{IoU} L_{IoU}

(1)

In Equation (1), L_A denotes the adversarial loss, L₂ denotes the L2 loss, L_VGG denotes the perceptual loss, L_Sobel denotes the Sobel edge loss, L_IoU denotes the lumped IoU loss, and λ is the weight coefficient of the corresponding loss function.

L2 loss is used to measure the pixel-level difference between the generated image and the real image, encouraging the generator to produce results whose overall structure is consistent with the real image. The L2 loss is combined with other loss functions to ensure the generated image’s basic shape matches the target image.

L_{2} = ∥ y - G (x) ∥_{2}^{2}

(2)

where y is the real image and G(x) is the image generated by the generator.

The perceptual loss measures the difference between the generated image and the real image in the high-level feature space by extracting the intermediate feature layer of the VGG19 model. This loss captures the semantic information of the image and ensures that the generated clothing pattern is not only realistic at the pixel level, but also similar to the target image in terms of perceptual quality.

L_{VGG} = \sum_{i} ∥ ϕ_{i} (y) - ϕ_{i} (G (x)) ∥_{2}^{2}

(3)

where ϕ_i denotes the ith feature layer of the VGG19 model.

Sobel Edge Loss: Sobel edge loss measures the matching of edges by calculating the edge images of the generated image and the target image (using the Sobel operator). This loss ensures that the edges of the generated image match the edges of the real image, thus enhancing the detailed presentation of the generated image, which is especially critical for contour lines in clothing patterns.

L_{Sobel} = ∥ Sobel (y) - Sobel (G (x)) ∥_{2}^{2}

(4)

where Sobel denotes the image edge computed using the Sobel operator.

Block IoU Loss: The Block IoU loss divides the image into small blocks and calculates the intersection and concurrency ratio (IoU) for each block. This loss helps to improve the accuracy of image generation in localized regions and ensures the consistency and integrity of the generated image in localized chunks. This is especially important for dealing with localized details in apparel pattern-making images.

L_{I o U} = 1 - \frac{1}{n} \sum_{j = 1}^{n} \frac{|B_{j} \cap T_{j}|}{|B_{j} \cup T_{j}|}

(5)

where B_j and T_j denote the pixel values of the generated image and the real image in the jth chunk, respectively. The process is shown in Figure 4.

3.2. Diffusion Probabilistic Models

While diffusion models theoretically demonstrate superior detail restoration capabilities, their direct application to clothing pattern generation faces practical constraints. The training of conditional diffusion models requires paired data with pixel-level alignment (design sketches → pattern drafts) and requires a lot of data support. However, the limited data size of this experiment resulted in an inability to adequately cover the complex structural variations in the clothing patterns, which caused unstable gradient propagation during the diffusion model training process.

To address this data bottleneck, a two-stage generation–refinement framework is proposed. The improved Pix2Pix model first establishes coarse structural correspondences under limited supervision (Section 3.1), generating preliminary patterns with 87.6% structural integrity (measured by edge continuity metrics). These synthetic patterns, though containing localized artifacts, provide augmented training samples for the subsequent diffusion model. Crucially, the conditional diffusion process learns a mapping from noisy variants of synthetic patterns to their clean versions, effectively disentangling structural preservation (handled by Pix2Pix) and detail refinement (handled by DPMs). This layered approach with hybrid generative architecture is well validated in this experiment where GAN-based coarse generation and diffusion-based refinement achieve synergistic performance gains in data-scarce scenarios.

In this work, we adopt a channel-wise concatenation strategy to inject conditional information, as it provides a straightforward and computationally efficient way to leverage the structural prior from the improved Pix2Pix model. Given the limited dataset size and the two-stage framework, this simple yet effective design achieves satisfactory results, while more sophisticated conditioning mechanisms are left for future exploration.

As shown in Figure 5, the specific process of detail repair of the generated clothing patterns using the DPMs model is shown. Firstly, the random noise X_T is combined with the preliminary clothing patterns S generated by the Pix2Pix model as an input; secondly, a new image X_(T−1) is generated by the diffusion model; and finally, continuous sampling is performed to generate the repaired clothing patterns x0 to realize the detail performance and overall quality of the image.

In the proposed two-stage framework, the initial pattern S generated by the improved Pix2Pix serves as the structural prior for the diffusion model. To ensure that the subsequent refinement process does not distort the global structure already established by Pix2Pix, we introduce a conditional channel-wise concatenation mechanism. Specifically, both S and the noisy image X_T are of the same spatial resolution (H × W), and are concatenated along the channel dimension to form a tensor I ∈ ^RH×W×2. No additional spatial transformation is applied, thereby preserving pixel-level alignment between the structural prior and the noisy input.

During the reverse denoising process, the conditional information is incorporated into each denoising step by feeding I into the denoising U-Net. The network learns to predict the noise conditioned on both the current noisy image and the structural prior. To prevent structural degradation, we do not apply random masking or augmentation to the conditional input S. Instead, S remains fixed throughout the reverse process, acting as a geometric anchor that guides the diffusion model to refine only local details (e.g., edge sharpness, stitch continuity) without altering the global contour. This design ensures that the diffusion model’s generative flexibility is constrained by the structural integrity provided by Pix2Pix, achieving a synergistic balance between coarse structural accuracy and fine-grained detail enhancement. The formula for the whole inverse denoising process can be expanded as

p_{θ} (x_{t - 1} | x, S) = 𝒩 (x_{t - 1}; μ_{θ} (x_{t}, t, S), \sum_{θ} (x_{t}, t, S))

(6)

where μ_θ and Σ_θ are the mean and variance predicted by the neural network based on the current image state x_t, the time step t and the condition information S. N denotes a normal distribution.

During training, the model minimizes the mean square error with respect to the true noise ϵ by predicting the noise ϵθ (x_t,t):

L = 𝔼_{t, x_{0}, ε} [{‖ε - ε_{θ} (x_{t}, t)‖}^{2}]

(7)

3.3. Intrinsic Collaborative Mechanism of the Two-Stage Framework

The core of the collaborative mechanism of the proposed two-stage generation–refinement framework lies in the functional division, complementary advantages, and mutual constraints between the improved Pix2Pix module and the conditional diffusion module. This avoids the respective limitations of the two individual models and achieves enhanced synergistic performance. The specific collaborative logic is divided into three core dimensions:

(1): Training Stage: Data Augmentation and Difficulty Decoupling for Collaborative Optimization. First, the improved Pix2Pix module is trained on the original small-scale labeled paired dataset to learn the structural mapping from design sketches to pattern diagrams. After training is completed, the Pix2Pix model generates a large number of coarse-grained pattern samples with complete basic structures, which are combined with the original real patterns to form an augmented dataset for the diffusion model. This expands the training data volume from the original 990 samples to 3120 augmented samples, effectively mitigating the overfitting and unstable training issues of the diffusion model on small datasets. At the same time, this two-stage framework achieves a decoupling of learning difficulty: the Pix2Pix module is solely responsible for learning the core geometric structure and topological mapping, while the diffusion module is solely responsible for learning detail refinement and line optimization given the structural prior. This decoupling avoids the problem of a single model having to learn both global structure and local details simultaneously, which can lead to convergence difficulties and performance degradation, and significantly reduces the learning difficulty of each module.
(2): Inference Stage: Structural Prior Constraints and Detail Enhancement for Collaborative Generation. During the inference stage, the improved Pix2Pix first generates a coarse-grained pattern with a complete basic structure based on the input design sketch, and this pattern is used as the conditional input for the diffusion model. This coarse-grained pattern provides a strict structural prior and geometric constraints for the diffusion model, ensuring that the diffusion model does not deviate from the target structure during the denoising generation process, thus solving the core problem of structural distortion that occurs with independent diffusion models in small-sample scenarios.

Building on this, the conditional diffusion model takes the coarse-grained pattern as a condition and progressively optimizes the edge smoothness, line continuity, and local detail integrity of the pattern through an iterative denoising process, while strictly preserving the overall topology of the coarse-grained pattern. This process perfectly combines the structural stability of Pix2Pix with the detail generation capability of the diffusion model, achieving high-precision generation of industrial-grade patterns.

(3): Performance Complementarity: Addressing the Limitations of Single Models. For the standalone improved Pix2Pix model, although it maintains structural generation stability under small-sample conditions, it is limited by the adversarial learning mechanism of GANs. It is prone to issues such as edge blurring, local line breaks, and detail loss in complex structures like curved lines and cutting lines, failing to meet industrial production requirements. For the standalone conditional diffusion model, despite its excellent detail generation capability, it is susceptible to structural deformation, topological deviation, and unstable generation under small annotated datasets, rendering the generated patterns unsuitable for direct production use.

The two-stage collaborative framework perfectly addresses the aforementioned limitations: Pix2Pix provides stable structural support for the diffusion model, while the diffusion model compensates for Pix2Pix’s deficiencies in detail, achieving a balance between structural accuracy, detail quality, and generation stability.

4. Experiments

4.1. Dataset and Implementation Details

To effectively train and evaluate the performance of the improved Pix2Pix and diffusion models in automatically generating clothing patterns, dataset preparation is crucial. This study focuses on generating sleeve and back piece patterns for several specific clothing styles, including women’s shirts, men’s shirts, women’s jackets, and vests. We extracted 2D cross-sections of the front view from the 3D clothing data generated in [36] and obtained the contour information of the sleeve section, which was used as the conditional input for the model. The corresponding output is the pattern of the sleeve section. The sleeve samples include 400 examples of different styles, such as short sleeves, long sleeves, and bell sleeves. Figure 6 shows some sleeve types. The back panel dataset sources include publicly available clothing design databases, clothing images from e-commerce websites, clothing patterns provided by professional patternmakers, and design-to-pattern comparison diagrams exported from the clothing design software Boke-CAD (http://www.bokcad.com/ (accessed on 6 April 2026)), totaling 590 drawings. Partial types of back panel datasets are shown in Figure 7. Although the dataset contains 990 samples in total (400 sleeves and 590 back panels), it is constructed with high quality and strong representativeness. The sleeve samples cover three typical styles: short sleeves (42%), long sleeves (35%), and bell sleeves (23%). The back panel samples include women’s shirts (30%), men’s shirts (25%), women’s jackets (25%), and vests (20%), reflecting a relatively balanced distribution across common garment categories.

In order to ensure the quality of the data and the learning effect of the model, all images need to be standardized and processed before inputting into the model, so as to construct a high-quality dataset with good generalization ability, and the processing method mainly includes the following parts:

Part Segmentation: Since the part information contained in the back piece designs of the clothing is not unique, covering the back piece, sleeve, collar and other parts, this study is aimed at the generation of the sleeve and back panel, which needs to be extracted from the designs of the clothing first, and take it as the input condition for the generation of the clothing patterns.

Size Adjustment: Adjust all designs and clothing patterns to a uniform resolution (256 × 256 pixels), which is sufficient to capture the key structural features of clothing patterns (such as contour lines and cutting lines) while ensuring the stability of model training. This helps to improve the consistency of model training and reduce learning bias caused by differences in image size.

Data Enhancement: Data enhancement is performed by rotating, scaling, flipping, and brightness adjustment to extend the diversity of the dataset and enhance the model’s adaptability and robustness to changes in a variety of clothing from design to plate making.

Data Division: The dataset is divided into training set, validation set and test set to ensure the effectiveness of model training and avoid overfitting. In this study, the training set accounts for 90% of the dataset and the test set accounts for 10%.

In training the improved Pix2Pix model, the Adam optimizer was employed with an initial learning rate of 1 × 10⁻⁴, β₁ = 0.5, and β₂ = 0.999. The batch size was set to 8 based on GPU memory constraints (NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM). Weight decay was set to 1 × 10⁻⁵ to prevent overfitting given the limited dataset size. Gradient clipping was applied with a maximum norm of 1.0 to stabilize training and prevent gradient explosion. All improved Pix2Pix models, the Adam optimizer and discriminators were initialized using the Xavier normal initialization [37], with bias terms initialized to zero. The U-Net encoder–decoder weights followed the same initialization scheme to ensure consistent gradient flow. All input designs and generated patterns are fixed to 256 × 256 pixels with values normalized to the [−1, 1] range. The model was trained for 200 epochs on the GPU.

After completing the training of the improved Pix2Pix model, we further trained the conditional diffusion model. The conditional diffusion model generates higher-quality clothing patterns by splicing the initial plate maps and the noisy images in the channel dimension to form a new input tensor, which is progressively denoised to generate higher-quality patterns. The training dataset for the conditional diffusion model is derived from preliminary patterns generated by the improved Pix2Pix model across different training epochs. After data augmentation, 3120 augmented training samples are provided. The U-Net backbone network of the diffusion model is initialized using Kaiming normal initialization [38], which is a more suitable initialization method for models with a large number of residual connections. The specific implementation details are as follows.

Optimizer and learning rate: The training employed the Adam optimizer with a fixed learning rate of 1 × 10⁻⁴, β₁ = 0.9, and β₂ = 0.999. Due to the increased memory usage caused by multi-step noise prediction, the batch size was set to 4.

Timesteps and noise addition: The model is set up with 500 timesteps, and the noise is added gradually by linear beta scheduling during the forward process. The noise for each timestep is generated from a standard normal distribution and combined with the input image to generate a noisy image.

Introduction of conditional input: In the reverse denoising process, the model passes the initial clothing patterns as a conditional input to the neural network together with the noisy image. This conditional input helps the model to retain the structural information of the initial patterns, gradually strengthen this information in the denoising process, and finally generate a high-quality clothing pattern that meets the expectations.

Loss Function: During training, a mean square error (MSE) loss is used to measure the difference between the noise predicted by the model and the true noise, thus guiding the model to recover the original image more accurately.

Data preprocessing: All input data were scaled to single-channel grayscale images of 256 × 256 pixels, and data enhancement techniques (e.g., random level flipping) were used to increase data diversity and improve the generalization ability of the model.

The diffusion model takes about 2 h to train. After 80 epochs of training, the model progressively generates fine patterns matching the input conditions through a sampling process, starting from random noise images and combining the initial pattern’s information. The reverse denoising process took an average of 0.08 s per time step (500 steps in total) and 40 s to generate a single complete image. In comparison, the improved Pix2Pix model takes only 1 s to generate a single pattern. Although more time-consuming to train and sample, this conditional diffusion model further improves the detail and structural integrity of the generated images and shows excellent performance in the clothing patterns generation task.

4.2. Analysis of Weighting Factors and Evaluation Indicators

In order to evaluate the effectiveness of the improved Pix2Pix in the generation of clothing patterns and the conditional diffusion model in the restoration of image details from an objective level, we choose to quantitatively evaluate the quality of the generated images by using three indexes, namely, SSIM, PSNR, and LPIPS. The SSIM is mainly a measure of the structural similarity between the generated images and the real images, and the closer the value is to 1, the closer the image structure is to the real image, and the higher the quality is. In the experiments, SSIM is primarily computed on Canny edge-detected images to specifically evaluate the structural fidelity of contour lines, which are the most critical information in garment patterns. While this approach may not fully reflect the overall image quality, it provides a direct assessment of line structure preservation. To ensure comprehensiveness, we also report SSIM computed on original grayscale images for all experiments, and we observe consistent trends across both evaluation settings. We selected several models and the test results are listed in Table 1. PSNR is used to measure the reconstruction quality of the image, reflecting the pixel difference between the generated image and the real image, with higher values indicating lower image noise and better quality. These two metrics are defined in Equations (8)–(10). Learned Perceptual Image Patch Similarity (LPIPS) [39] measures the perceptual similarity between images by computing the distance between deep features extracted from a pretrained neural network. Lower LPIPS values indicate greater perceptual similarity, aligning better with human visual judgment.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) + (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(8)

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {(I (i, j) - K (i, j))}^{2}

(9)

P S N R = 10 \times \log_{10} (\frac{M A X_{I}^{2}}{M S E})

(10)

where m × n represents the image size, I represents the gray level, K represents the noise image, and MSE represents the mean square error formula. MAX_I² indicates the maximum possible pixel value of the image. x and y represent images. “μ_x” and “μ_y” are the average of x and y. σ_x² and σ_y² are the variances of x and y. C1 = (k₁L)² and C2 = (k₂L)² are two constants used to maintain stability to avoid dividing by zero; L is the range of pixel values, indicating that the L value of the B-bit image is 2^B−1. The PSNR metric quantifies the ratio between the maximum possible power of the image and the power of noise (MSE).

To determine the optimal weight coefficients λ in the composite loss function, an exhaustive grid search was conducted over predefined ranges of λ values. The search ranges for each weight coefficient were determined based on preliminary experiments and common practices in the image-to-image translation literature [17]. The search space included λ₂ ∈ {0.5, 1.0, 1.5}, λ_VGG ∈ {0.5, 1.0, 1.5}, λ_Sobel ∈ {0.5, 1.0, 1.5}, and λ_IoU ∈ {0.3, 0.5, 0.7}. A total of 81 combinations of weight coefficients is available, and for each combination we train the model on the training set and evaluate the SSIM, PSNR, and LPIPS metrics on the validation set.

The final configuration (λ₂ = 1.0, λ_VGG = 1.0, λ_Sobel = 1.0, λ_IoU = 0.5) was selected based on achieving the highest SSIM (0.82) and PSNR (22.31) while maintaining a low perceptual index. This process ensured a balanced contribution of each loss component to the overall optimization goal.

The grid search process revealed that the L2 loss (λ₂) and perceptual loss (λ_VGG) are critical for maintaining global structural consistency, while the Sobel edge loss (λ_Sobel) ensured accurate contour generation. The Block IoU loss (λ_IoU) needed careful adjustment. Higher values led to excessively strict penalties for local alignment, while lower values provided flexibility for complex pattern variations. This systematic approach to weight selection demonstrates the necessity of balancing diverse optimization objectives in multi-task learning scenarios.

4.3. Experimental Results and Comparison with State-of-the-Art Methods

To comprehensively evaluate the performance advantages of the proposed method on clothing components of varying complexity, this study conducted experiments on relatively simple sleeve patterns and complex back panels. Quantitative and qualitative analyses of the back panels were performed using current mainstream methods to validate the effectiveness and generalization capabilities of the proposed method.

4.3.1. Generation of Pattern-Making for the Sleeve

As a basic component of clothing, sleeve patterns in the dataset mainly feature external contour lines without complex cutting lines or dart structures, making them ideal for verifying the basic generation capabilities of the model.

Figure 8 shows the specific results of the improved Pix2Pix model combined with the diffusion model repair in the sleeve pattern generation task. From top to bottom, there is the design drawing of sleeve outline, actual sleeve pattern, initial sleeve pattern generated by the improved Pix2Pix model, and final sleeve pattern repaired by the diffusion model. The improved initial template generated by Pix2Pix can now capture basic contour structures, such as the curved line of the sleeve cap and the straight edge of the cuff, which are consistent with the actual template. However, there is slight blurring in some local lines (such as the transition line where the sleeve cap connects to the main body), and some details (such as the slight curvature at the bottom of the sleeve cap) have not been fully restored. After repair using the diffusion model, the contour lines of the final pattern become clearer and sharper, with previously blurred transition areas now smooth and continuous. The details of the curvature at the bottom of the sleeve opening align closely with the real pattern, and the overall structure exhibits features highly consistent with the real pattern.

This result indicates that the improved Pix2Pix model can effectively learn the basic structural mapping from design sketches to patterns, while the diffusion model’s restoration process further optimizes local details, significantly enhancing the generation quality of sleeve patterns, thereby validating the model’s effectiveness in basic component generation tasks.

4.3.2. Generation of Pattern-Making for the Back Panels

As a core component of clothing, the back panels contain complex structures such as cutting lines, provincial paths, and back center lines, which pose higher requirements for the detailed capture of the model and the consistency of the global structure. It is a key object for verifying the robustness of the method.

Figure 9: From top to bottom, there is the performance of input image generation, real shape, improved Pix2Pix, DPMs repair model, ControlNet [40], GAN, FUNIT [41], and Pix2Pix on selected validation sets. ControlNet is an image generation control network that precisely controls the shape and details of the generated results by introducing additional conditional information, which has the advantage of being flexibly applied to a variety of image generation tasks. GAN generates highly lifelike images through the adversarial learning mechanism, which has a rich potential for applications in the fields of image generation, restoration, and style migration. FUNIT (Few-shot Unsupervised Image-to-Image Transformation) is an image generation control network that is applicable to a wide range of image generation tasks. It can achieve effective image generation using a small number of training samples, supports diverse transformations for different image categories, and is suitable for scenarios with scarce data. While recent garment-specific models exist (e.g., SewingLDM, GarmentDiffusion), they are designed for substantially different tasks (3D reconstruction, text-to-pattern) and require data scales incompatible with our setting, making direct comparison less meaningful. We have qualitatively discussed these methods in Section 2.4.

The results show that Pix2Pix achieves the best performance for this task among other generative models, and the improved Pix2Pix is closer to the real shape, but there are a few cases of blurred or incomplete lines. After repairing the DPM, the output results achieved line continuity and clearer images. In contrast, ControlNet, GAN and FUNIT are less effective in repairing, and the generated contour lines are blurred and incomplete, or even have errors. It can be seen that, faced with the situation of generating garment patterns based on design drawings, the improved Pix2Pix and DPMs restoration have obvious advantages in generating images and recovering details, respectively.

The evaluation results are shown in Table 1, where the SSIM, PSNR, and LPIPS of the proposed method are 0.869, 22.31, and 0.1318, respectively, and it achieves the best performance in generated image accuracy and clarity compared with the Pix2Pix, ControlNet, GAN, and FUNIT models, which again proves that improved Pix2Pix and DPMs repaired have the best results and accuracy in both image generation and detail recovery, respectively.

From the perspective of industrial landing, the back panel, as the core component of garments, contains complex structures such as darts, cutting lines and back center lines, and is the most time-consuming and experience-dependent part in manual pattern-making (taking 1–2 h for a single style on average). The 11.4% SSIM improvement of our method compared with the original Pix2Pix ensures that the position deviation of darts and cutting lines is controlled within the tolerance range of industrial production, and the generated pattern has a complete structure and clear continuous lines. In contrast, patterns generated by ControlNet, GAN and FUNIT have problems such as blurred contours, broken lines and structural dislocation, which require a lot of manual correction and cannot be directly used for production. The patterns generated by our method can be directly imported into mainstream garment CAD software (such as Boke-CAD) for subsequent typesetting and cutting, realizing end-to-end automation from design drawings to industrial production patterns, which greatly reduces the labor cost of pattern-making and the dependence on skilled workers.

4.4. Ablation Study

To validate the necessity of each component in our framework, we conduct ablation studies by removing specific loss functions or modules. The experiments are divided into two parts, loss function ablation and module ablation, as shown in Table 2 and Table 3 and Figure 10 and Figure 11.

4.4.1. Loss Function Ablation

We systematically evaluated the contributions of individual loss components in the improved Pix2Pix model. By sequentially removing single loss terms and retraining the model, the impacts on generation quality were quantified.

Removal of the Block IoU loss reduces the SSIM from 0.780 to 0.744. The Block IoU loss improves contour alignment accuracy for critical regions (e.g., sleeve holes, collars, silhouettes) via local region constraint. Its absence degrades local details, such as broken stitching at the back panel hem.

Removal of the Sobel edge loss causes a sharp drop in PSNR from 20.20 to 18.98. The Sobel edge loss enforces edge matching to ensure structural line continuity (e.g., back midline, shoulder line), and its removal causes significant edge blurring.

Removal of the VGG perceptual loss reduces the SSIM to 0.698 and degrades the LPIPS to 0.2344 due to the lack of high-level semantic feature alignment. Its absence triggers semantic errors, leading to semantic deviations of the generated pattern from the real sample.

The removal of the L₂ loss, pixel-level deviations accumulate, introducing noise and global shape distortions.

The composite loss function achieves synergistic enhancement across global structure (L₂, L_VGG), local details (L_IOU), and edge precision (L_Sobel) through multi-scale optimization. This design effectively balances pixel-level reconstruction with visual perception, providing a robust optimization framework for complex garment pattern generation.

4.4.2. Module Ablation

To validate the effectiveness of each module in the model, this study sequentially removed the multi-scale discriminator, the self-attention mechanism in the generator, and the diffusion model repair module. The impacts were analyzed through quantitative metrics and visual comparisons.

Multi-scale discriminator removal: Eliminating the multi-scale discriminator causes edge blurring and reduced texture fidelity, and alleviates structural deformation in contour and cutting line regions.

Self-Attention Mechanism Removal: Removal of the self-attention mechanism causes misalignment in complex structural intersections (e.g., back yoke and side seam joints), with the PSNR dropping to 19.33 and the SSIM to 0.75 due to the lack of topological coherence preservation via long-range dependency modeling.

DPM Repair Module Removal: Without diffusion-based refinement, the baseline Pix2Pix output exhibits two key defects, localized artifacts such as jagged hems and inconsistent stitch densities across different pattern blocks. The important role of DPM in perceptual quality enhancement through probabilistic denoising is confirmed not only visually but also in quantitative metrics.

The hierarchical architecture achieves synergistic advantages through complementary functional integration, the improved Pix2Pix framework leverages multi-scale adversarial learning to establish robust structural priors, ensuring fundamental geometric accuracy in key pattern components such as armholes and silhouettes; concurrently, the diffusion probabilistic models (DPMs) operate under these structural constraints to refine stochastic texture details, effectively resolving edge irregularities and material-specific stitch patterns through iterative denoising. Crucially, the embedded self-attention mechanism dynamically coordinates global–local feature interactions, enabling precise parameterization of pattern dimensions while maintaining proportional consistency across spatially distant regions (e.g., collar-to-hem relationships). This modular co-design directly addresses three critical limitations of standalone GANs, structural instability in complex topologies, texture detail degradation under high-resolution requirements, and inadequate long-range spatial correlations, thereby establishing an industrially viable framework for automated precision pattern generation in mass apparel production.

4.4.3. Discriminator Architecture Ablation

To quantify the effectiveness of the proposed multi-scale discriminator, we conducted an ablation study comparing three configurations: (a) global discriminator only (removing the local PatchGAN discriminator D1), (b) local discriminator only (removing the global discriminator D2), and (c) multi-scale discriminator (D1 + D2, our full model). All other hyperparameters and training settings remain identical.

As shown in Table 4, the multi-scale discriminator consistently outperforms both single-discriminator variants across all three metrics. Specifically, compared with the global-only configuration, the multi-scale discriminator improves SSIM from 0.762 to 0.780, PSNR from 19.52 to 20.20, and reduces LPIPS from 0.1919 to 0.1710. Compared with the local-only configuration, the improvements are even more pronounced, particularly in SSIM (0.745→0.780), indicating that the global discriminator plays a crucial role in maintaining overall structural coherence.

These results demonstrate that the combination of local and global discriminators provides complementary advantages: the local discriminator (PatchGAN) enforces fine-grained texture and edge realism, while the global discriminator ensures global shape and semantic consistency. Removing either leads to a noticeable degradation in either detail quality or structural accuracy, validating the necessity of the multi-scale design in our framework.

5. Conclusions

This paper proposes an automatic clothing pattern generation framework that combines the improved Pix2Pix model with the conditional diffusion model. This framework enhances the generation accuracy and quality of clothing patterns by optimizing the network architecture, designing a composite loss function, and constructing a two-stage generation–repair process. The methods described in this paper have the following contributions:

(1): Improving the structure capturing ability of Pix2Pix: By designing a multi-scale discriminator (fusing local PatchGAN and global discriminator) and a composite structure-aware loss function, it balances global structure consistency and local detail accuracy, overcoming the limitations of traditional Pix2Pix in expressing structural differences in clothing patterns.
(2): Two-stage generation–repair framework for data scarcity: Proposing a hybrid architecture combining a GAN and diffusion model, the improved Pix2Pix establishes a rough structural correspondence under limited supervision, providing enhanced training samples for the diffusion model; the diffusion model optimizes details based on these structures, alleviating the instability of the diffusion model in small datasets, and achieving a collaborative improvement in structural integrity and detail quality.
(3): Progressive verification logic for multiple complexity components: Verifying the model performance from sleeves (a simple component containing only contour lines) to the back piece (a complex component containing cutting lines and pleats). This logic not only validates the basic generation ability of the framework but also confirms its robustness in handling complex structures, fully demonstrating its adaptability to different complexity components. Compared with existing methods (such as ControlNet, GAN, FUNIT), this framework performs exceptionally well in the clothing pattern generation task, with SSIM reaching 0.869, PSNR reaching 22.31, and LPIPS reaching 0.1318, while ensuring the accuracy and clarity of the generated images, confirming its practicality and effectiveness in automated clothing pattern production.

Despite these advancements, this method still has some limitations: due to the small size of the dataset, its generalization ability in rare asymmetric and wrinkled designs is limited; for complex components like the front cover, its coverage is also not comprehensive. In addition to these technical limitations, applying the proposed framework to industrial production lines also faces some practical challenges. Firstly, data collection and standardization are not easy, as the framework requires paired and pixel-aligned data, while the industrial environment involves various formats (such as CAD files, hand-drawn sketches) and uneven quality. Secondly, integrating with existing CAD/CAM workflows is a challenge, as industrial systems typically rely on parametric representations (such as B-splines) rather than bitmap images, which require additional post-processing to achieve interoperability. Thirdly, manufacturing constraints such as seam allowances and texture line alignment are not explicitly encoded in the current framework, meaning that the output may be visually reasonable but not suitable for production use. Addressing these challenges will be the key direction of future work through domain-adaptive data pipelines, geometric constraint modeling, and efficient reasoning.

Author Contributions

Conceptualization, X.Z. and X.L.; Methodology, X.L.; Software, X.L.; Validation, X.L.; Formal Analysis, X.L.; Investigation, X.L.; Resources, X.L.; Data Curation, X.L.; Writing—Original Draft Preparation, X.L.; Writing—Review and Editing, X.L. and X.Z.; Visualization, X.L.; Supervision, B.L.; Project Administration, B.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Bing Liu and Bingshun Xu were employed by the company Hangzhou Zhongfu Technology & Innovation Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GAN	Generative Adversarial Network
DPMs	Diffusion Probabilistic Models
SSIM	Structural Similarity Index
PSNR	Peak Signal-to-Noise Ratio
LPIPS	Learned Perceptual Image Patch Similarity
cGANs	Conditional Generative Adversarial Networks
MSE	Mean Square Error
IoU	Intersection over Union

References

Lee, J.; Nguyen, D.; Kim, J.; Kang, J.; Lee, S. Double reverse diffusion for realistic garment reconstruction from images. Eng. Appl. Artif. Intell. 2024, 127, 107404. [Google Scholar] [CrossRef]
Ma, W.; Guan, Z.; Wang, X.; Zhang, Z.; Cao, J. Research on reflective clothing recognition algorithm based on combining omni-dimensional dynamic convolution and partial convolution. Eng. Appl. Artif. Intell. 2024, 137, 109180. [Google Scholar] [CrossRef]
Lv, Z.; Li, X.; Li, X.; Li, F.; Lin, T.; He, D.; Zuo, W. Learning semantic person image generation by region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 10806–10815. [Google Scholar]
Li, T.; Du, L.; Huang, Z.; Jiang, Y.; Zou, F. Review on pattern conversion technology based on garment flat recognition. J. Text. Res. 2020, 41, 145–151. [Google Scholar]
Huang, X.; Hou, Y.; Yang, Y. Automatic generation of high-precision garment patterns based on improved deep learning model. J. Text. Res. 2025, 46, 236–243. [Google Scholar]
Li, Y.; Wu, X.; Wu, G.; Cong, H. Parametric design modeling and implementation of patterns for knit sweaters. J. Text. Res. 2023, 44, 168–174. [Google Scholar]
Liu, R.; Xie, H. Similarity pattern matching technology based on garment structural feature recognition. J. Text. Res. 2023, 44, 134–142. [Google Scholar]
Korosteleva, M.; Lee, S.H. NeuralTailor: Reconstructing sewing pattern structures from 3D point clouds of garments. ACM Trans. Graph. 2022, 41, 109180. [Google Scholar] [CrossRef]
Tao, X.; Gao, H.; Yang, K.; Wu, Q. Expanding the defect image dataset of composite material coating with enhanced image-to-image translation. Eng. Appl. Artif. Intell. 2024, 133, 108590. [Google Scholar] [CrossRef]
Liu, L.; Zhang, H.; Ji, Y.; Wu, Q.J. Toward AI fashion design: An Attribute-GAN model for clothing match. Neurocomputing 2019, 341, 156–167. [Google Scholar] [CrossRef]
Cui, R.Y.; Liu, Q.; Gao, Y.C.; Su, Z. FashionGAN: Display your fashion design using Conditional Generative Adversarial Nets. Comput. Graph. Forum 2018, 37, 109–119. [Google Scholar] [CrossRef]
Yang, C.; Mohsen, M. Attribute-Aware Generative Design with Generative Adversarial Networks. IEEE Access 2020, 8, 190710–190721. [Google Scholar] [CrossRef]
Zhang, H.; Sun, Y.; Liu, L.; Wang, X.; Li, L.; Liu, W. ClothingOut: A category-supervised GAN model for clothing segmentation and retrieval. Neural Comput. Appl. 2018, 32, 4519–4530. [Google Scholar] [CrossRef]
Ma, Q.; Yan, J.; Ramesh, A.; Pujades, S.; Pons-Moll, G.; Tang, S.; Black, M.J. Learning to dress 3D people in generative clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 6468–6477. [Google Scholar]
Ke, H.; Wang, Y. Deep learning techniques in modern women’s smart clothing design. Appl. Math. Nonlinear Sci. 2024, 9, 1–15. [Google Scholar] [CrossRef]
Tahmid, M.; Alam, S.; Rao, N.; Ashrafi, K.M.A. Image-to-Image Translation with Conditional Adversarial Networks. In 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Isola, P.; Zhu, J.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017. [Google Scholar]
Lin, E. Comparative Analysis of Pix2Pix and CycleGAN for Image-to-Image Translation. Highlights Sci. Eng. Technol. 2023, 39, 915–925. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32, 1–13. [Google Scholar]
Carrillo, H.; Clément, M.; Bugeau, A.; Simo-Serra, E. Diffusart: Enhancing Line Art Colorization with Conditional Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada; IEEE: New York, NY, USA, 2023; pp. 3486–3490. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Denton, E.L.; Chintala, S.; Fergus, R. Deep generative image models using a Laplacian pyramid of adversarial networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1486–1494. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Berthelot, D.; Schumm, T.; Metz, L. BEGAN: Boundary equilibrium generative adversarial networks. arXiv 2017, arXiv:1703.10717. [Google Scholar] [CrossRef]
Zhu, J.Y.; Krähenbühl, P.; Shechtman, E.; Efros, A.A. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 597–613. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.N.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv 2020, arXiv:2011.13456. [Google Scholar] [CrossRef]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image Super-Resolution via Iterative Refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 4713–4726. [Google Scholar] [CrossRef]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 11451–11461. [Google Scholar]
Liu, S.; Cheng, Y.; Chen, Z.; Ren, X.; Zhu, W.; Li, L.; Bi, M.; Yang, X.; Yan, Y. Multimodal latent diffusion model for complex sewing pattern generation. arXiv 2024, arXiv:2412.14453. [Google Scholar] [CrossRef]
Li, X.; Yao, Q.; Wang, Y. GarmentDiffusion: 3D garment sewing pattern generation with multimodal diffusion transformers. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI); International Joint Conferences on Artificial Intelligence: Bremen, Germany, 2025; pp. 1458–1466. [Google Scholar]
Nakayama, K.; Ackermann, J.; Kesdogan, T.L.; Zheng, Y.; Korosteleva, M.; Sorkine-Hornung, O.; Guibas, L.J.; Yang, G.; Wetzstein, G. AIpparel: A multimodal foundation model for digital garments. arXiv 2024, arXiv:2412.03937. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2018; pp. 8798–8807. [Google Scholar]
Korosteleva, M.; Lee, S.H. Generating Datasets of 3D Garments with Sewing Patterns. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 71–88. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 3813–3824. [Google Scholar]
Liu, M.Y.; Huang, X.; Mallya, A.; Karras, T. Few-Shot Unsupervised Image-to-Image Translation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019; pp. 10550–10559. [Google Scholar]

Figure 1. (a) Traditional clothing manufacturing process. (b) Process for generating pattern sheets for sleeve parts using 3D clothing data. The clothing production process involves two methods: the traditional process and the process using 3D data.

Figure 2. Overall architecture of the improved Pix2Pix.

Figure 3. Generator architecture diagram.

Figure 4. Schematic of Block IoU loss.

Figure 5. Model diagram of DPM patterns restoration.

Figure 6. Examples of sleeve types. From top to bottom: 2D front view mapping of sleeves extracted from 3D clothing data, left sleeve outline, and corresponding sleeve pattern. The sleeve types shown include short sleeves, long sleeves, bell sleeves, etc.

Figure 7. The typical styles in the dataset, from left to right: women’s blazer, curved-hem shirt, shirt, women’s short-sleeve shirt, men’s blazer, suit, hoodie, and T-shirt. From top to bottom: clothing design drawing, corresponding fabric pieces, and pattern drawings.

Figure 8. The model generation result of the sleeve part.

Figure 9. Model generation results and comparison results for the back panels.

Figure 10. Visual comparison of ablation study on loss functions.

Figure 11. Impact of removing key modules.

Table 1. Quantitative comparison of the proposed method with state-of-the-art methods.

Category	Method	SSIM (Original Images)	SSIM (Canny Edge Images)	PSNR	LPIPS
Existing Methods	ControlNet	0.765	0.788	19.58	0.2397
	GAN	0.771	0.787	18.00	0.2622
	FUNIT	0.770	0.750	19.00	0.3223
	Pix2Pix	0.745	0.780	19.25	0.2013
Proposed Method	Improved Pix2Pix	0.764	0.780	20.20	0.1710
Proposed Method	DPMs repaired	0.782	0.869	22.31	0.1318

Table 2. Quantitative results of loss ablation.

Experimental Condition	SSIM	PSNR	LPIPS
Remove $L_{I O U}$	0.744	19.47	0.1992
Remove $L_{S o b e l}$	0.738	18.98	0.2121
Remove $L_{V G G}$	0.698	18.79	0.2344
Remove $L_{2}$	0.730	18.96	0.2033
Improved Pix2Pix (full loss)	0.780	20.20	0.1710

Table 3. Quantitative results of modular ablation.

Method	SSIM	PSNR	LPIPS
Remove multi-scale discriminator	0.783	20.00	0.1968
Removal of the self-attention mechanism	0.750	19.33	0.1844
Removal of diffusion model fixes (DPMs)	0.780	20.20	0.1710

Table 4. Discriminator architecture ablation.

Method	SSIM	PSNR	LPIPS
Local only (D1)	0.745	19.38	0.2043
Global only (D2)	0.762	19.52	0.1919
Multi-scale (D1 + D2)	0.780	20.20	0.1710

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, X.; Li, X.; Liu, B.; Xu, B. AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing. Electronics 2026, 15, 1751. https://doi.org/10.3390/electronics15081751

AMA Style

Zheng X, Li X, Liu B, Xu B. AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing. Electronics. 2026; 15(8):1751. https://doi.org/10.3390/electronics15081751

Chicago/Turabian Style

Zheng, Xiaohu, Xiechen Li, Bing Liu, and Bingshun Xu. 2026. "AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing" Electronics 15, no. 8: 1751. https://doi.org/10.3390/electronics15081751

APA Style

Zheng, X., Li, X., Liu, B., & Xu, B. (2026). AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing. Electronics, 15(8), 1751. https://doi.org/10.3390/electronics15081751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI Clothing Pattern Generation: Combining Improved Pix2Pix Image Generation and Diffusion Model Repairing

Abstract

1. Introduction

2. Related Work

2.1. GAN Image Generation

2.2. Quality Repairing for Diffusion Model

2.3. GAN and Diffusion Model Fusion Paradigms

2.4. Discussion of Recent Garment Generation Models

3. Method

3.1. Improved Pix2Pix Model

3.2. Diffusion Probabilistic Models

3.3. Intrinsic Collaborative Mechanism of the Two-Stage Framework

4. Experiments

4.1. Dataset and Implementation Details

4.2. Analysis of Weighting Factors and Evaluation Indicators

4.3. Experimental Results and Comparison with State-of-the-Art Methods

4.3.1. Generation of Pattern-Making for the Sleeve

4.3.2. Generation of Pattern-Making for the Back Panels

4.4. Ablation Study

4.4.1. Loss Function Ablation

4.4.2. Module Ablation

4.4.3. Discriminator Architecture Ablation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI