Readily Design and Try-On Garments by Manipulating Segmentation Images

: Recently, fashion industries have introduced artiﬁcial intelligence to provide new services, and research to combine fashion design and artiﬁcial intelligence has been continuously conducted. Among them, generative adversarial networks that synthesize realistic-looking images have been widely applied in the fashion industry. In this paper, a new apparel image is created using a generative model that can apply a new style to a desired area in a segmented image. It also creates a new fashion image by manipulating the segmentation image. Thus, interactive fashion image manipulation, which enables users to edit images by controlling segmentation images, is possible. This allows people to try new styles without the pain of inconvenient travel or changing clothes. Furthermore, they can easily determine which color and pattern suits the clothes they wear more, or whether the clothes other people wear match their clothes. Therefore, user-centered fashion design is possible. It is useful for virtually trying on or recommending clothes.


Introduction
Fashion research using artificial intelligence can be divided into four main categories: Detection, recommendation, analysis, and synthesis. Detection is the most basic study, recognizing where the clothing region is. It is possible to search for a picture of the clothes you are looking for and find the clothes or clothes of similar styles. When garments are difficult to express in words, an image-based search is useful for finding similar or identical items. Fashion recommendation is the problem of understanding clothing and learning suitability between other fashion items. It includes recommending outfits that reflect the user's taste and suggesting goods that fit the current style. Fashion analysis is a study that analyzes the characteristics of outfits, the latest trends, and people's styles. It has potential in the fashion industry, primarily in marketing fields. Lastly, fashion synthesis involves creating an image that reflects style changes and pose changes. Especially in the field of fashion design, research based on a generative model such as generative adversarial networks (GAN) [1] is actively being conducted. There are various studies applying such generation models, such as a model for making clothes when given the desired text explanation [2], a model for making clothes when people and clothes are given [3], a model for applying clothes worn by others [4], and a model for making clothes with a given pattern [5].
In this work, we introduce a novel approach that can transform clothes into the desired style and shape. In our approach, a person (P) can try-on clothes worn by others (A, B). There are some differences from other virtual try-on networks. First, we can manipulate the cloth pattern or shape as we want. For example, P can wear not only A's clothing but also A's t-shirt and B's pants. P can change the shape of the clothing such as sleeve length and neckline. Further, it is possible to change different kinds of clothes such as pants to skirt. Second, we can generate a model image in a different pose.
The main research contributions of this work are: (a) Collect additional images of people wearing colorful or patterned clothing for better results. (b) Semantic region-adaptive normalization (SEAN) [6] among a style transfer GAN was modified and applied in order to transform clothes into the desired styles. Style transfer is a change from the current style to another style, and in the paper, it converts the desired area from the semantic segmentation map to other styles. (c) Add an image correction step to make the generated image look more realistic.

Fashion Parsing
Image analysis can be largely divided into three categories: Classification, object detection, and segmentation. Classification finds classes such as 'shirt', 'pants', and 'dress' in the image. Object recognition indicates the position of an object within the image. Unlike classification or recognition, segmentation is the division of all objects in the image into semantic units ( Figure 1). The goal is to create a segmentation map by classifying all pixels into a specified class. For example, the pixels related to the pants are colored green, and the pixels related to the dress are colored blue. It is a high-level task that requires a complete understanding of the image because the class of each pixel must be predicted. A typical model is fully convolutional networks (FCNs) [7]. It uses convolutional layers instead of fully connected layers in a general classification convolutional neural network (CNN). Therefore, after the pixel's class is predicted while retaining the pixel's information, the reduced image size can be restored through up-sampling. Fashion parsing uses segmentation to classify apparel such as tops, pants, and dresses to obtain the desired information. Graphonomy [8] showed good fashion parsing performance by introducing a hierarchical graph. For example, the head area can be divided into hat, face, and hair. The head contains the face, and the face is next to the hair. Using a unique hierarchical relationship, all human parsing can be done from different domains or labels of different levels in one model. In this paper, Graphonomy was used in the preprocessing process to obtain more accurate segmentation maps than other segmentation networks.

Segmentation Image to Real Image
GAN cannot generate data in the desired direction. To improve this, a conditional GAN was developed to generate data based on additional information y. In other words, condition y is added to G and D of GAN so that y can generate data in the desired way [9].
The condition y can be a label, text, or image. If the label is a condition, images of the class are created. For example, it creates a pants image when the label is pants or a T-shirt image when it is a T-shirt. If the text is a condition, an image corresponding to the text is created. For example, given 'blue pants' as text, an image of blue pants is created. When an image is given as a condition, it is also called image-to-image translation. Image-to-image translation is learning the relationship between the input image and the result image. One of the image-to-image translation methods, Pix2Pix [10], converts an image to another style image. Using this, the segmentation images can be synthesized into actual images. The Pix2PixHD [11] is capable of high-definition image synthesis following the Pix2Pix. GauGAN [12] proposed by Nvidia produces a realistic image within seconds by modifying the segmentation image with a specified label. It can also be synthesized into any style. Figure 2 shows image-to-image translation's results.

Materials and Methods
Given a person image P and a model image M, we propose a costume design system that generates a person image P' wearing M-style outfits. It is possible to partially change the outfits, and new forms of clothing can be created through segmentation modification. This system is divided into three stages: Generating style code, applying modified SEAN, and image correction.

Dataset
Images were collected through publicly available DeepFashion [13] and web crawling. DeepFashion is a large-scale garments database that provides images and annotation data for clothing-related tasks such as clothing detection, landmark prediction, clothing segmentation, and search. In the experiment, an In-shop Clothes Retrieval Benchmark was used among the datasets provided by DeepFashion. The In-shop Clothes Retrieval Benchmark is largely divided into men's and women's, and each is categorized by the type of clothing (T-shirt, pants, jacket, etc.). In each item, there are images of people wearing that kind of apparel and posing in various poses from different angles. In this paper, only full-body and upper-body images of people dressed were used for better learning results. However, the DeepFashion dataset lacks diversity in patterns and colors of clothes. These are important factors in fashion design. Therefore, we collected additional images to complement this through web crawling. We collected images of people wearing patterned clothing such as stripes, dots, checks, flowers, tie-died, leopards, and camouflage, and people wearing clothing of various colors. Moreover, for the consistency of the dataset, only the full body and upper body image were added to the dataset. The images used in the experiment totaled 53,000 with 45,000 DeepFashion images and 8000 additional images, divided into 48,000 learning images and 5000 test images. The size of the images was 256 × 256.

Generate Style Code
Style code generation is a process of generating style codes for each part after filtering styles for each field in order to synthesize images based on the segmentation area ( Figure 4). A pre-learned Graphonomy network was used to obtain segmentation images. Entering a human image P into the network, it outputs the class label of each pixel. By assigning colors for each class, visualization is possible as shown in Figure 3. In the experiment, it produces segmentation images P_seg with 20 classes (background, hat, hair, gloves, sunglasses, top clothes, dress, coat, socks, pants, torso, scarf, skirt, face, left arm, right arm, left leg, right leg, left shoe, right Shoes). The style encoder takes P and P_seg as inputs and creates a (512 × 20)-dimension style matrix ST. As there are 20 classes, it creates 20 columns. Each column of ST corresponds to the style code of the area. The column of the class area that does not exist in the image becomes 0. The style encoder is trained to filter out region-specific style codes from the input image according to the corresponding segmentation mask. This style code creates a style map by performing convolution for each style and broadcasting to the corresponding area according to the segmentation mask ( Figure 4).

SEAN
In deep learning, normalization is used to normalize the output of the intermediate layer. Stable learning is possible by forcing the distribution of the output values of the activation function, but a lot of information is lost. GauGAN devised a spatially adaptive normalization (SPADE) that can preserve spatial information and produce better results. However, because SPADE has one style code, it is impossible to modify the image in detail because it can only be changed into one style. Therefore, semantic region-adaptive normalization (SEAN) was proposed in [6] so that each region can be individually controlled. As the image is created under the condition of the segmentation area to be changed, only that part can be controlled in a desired style. Using a simple convolutional network, γ s (scale) and β s (shift) for each pixel are obtained from the previous style. At the same time, normalization is performed in the same way as SPADE, and γ o and β o are created. They are combined by multiplying the weights. γ, β values and weights are learned in SEAN. As γ and β are not scalar values but tensors that are dimensions of space, they have the advantage of not losing spatial information. This can be expressed as follows.
ST is the style matrix, SM is the segmentation mask, N is the batch size, C is the number of channels, and H and W are the height and width of the activation map, respectively (n ∈ N, c ∈ C, y ∈ H, x ∈ W). µ c , σ c are the mean and variance of the activation on the c channel, respectively. γ c,y,x and β c,y,x are the weighted sum of γ s c,y,x and γ o c,y,x , β s c,y,x and β o c,y,x , respectively (Equation (2)). Normalize h n,c,x,y to the mean µ c and distributed σ c of the activation on the c channel. Conduct the denormalization under the condition of γ c,y,x and β c,y,x obtained by adding segmentation and style matrix ST, respectively. Figure 5 shows the structure of semantic region-adaptive normalization.

Modified SEAN
The SEAN Block has regional style codes, segmentation, and noise as input. In this paper, an additional network is added after the SEAN block to improve performance. ResNet [14] improves performance by solving the gradient vanishing problem. The unit of SEAN with Resnet structure is called SEAN Resnet block (ResBlK). The structure of SEAN ResBlK is shown in Figure 6. SENet [15] consists of a squeeze step and an excitation step. When it is added to the model, there is not much increase in hyperparameters, so model complexity and computations do not increase significantly. However, the performance improvement of models is quite high. By adding a network like SENet before the up-sampling step, features can be emphasized in consideration of SEAN ResBlK's importance per channel. It also improves performance by making the model converge faster and the synthesized image look smoother. We called this step SENet block (SEBlK). Structure of SEBlK is shown in Figure 7.  . Structure of SEBlK. The squeeze step converts the feature maps of H × W of C channels into a feature map of 1 × 1 size. One value is obtained by averaging the two-dimensional feature map. That is, the H × W × C feature map is converted into a 1 × 1 × c feature map. The excitation step is performed to find out the relative importance of each channel through the feature map Conv and the activation function. The relative importance of each channel can be expressed as a value between 0 and 1, which is multiplied by x to obtain x in the scale process. Therefore, it is trained to input the mask, style code, and noise into SEAN and reconstruct the input image through SEBlK and up-sampling. Image generation process is shown in Figure 8. As a result of learning, it is possible to modify the image by changing the style image and the segmentation mask.

Generated Image Correction
The face is unclear because the face area is also created when the image is created. In addition, as the style of model M is reflected, the face area has the face of model M, not the face of person P. Therefore, correction is needed for better results and preserving P's face in the generated image. Figure 9 shows the editing process. In this paper, after extracting the human face (P_face) through the operation of the human image (P) and the segmentation image (P_seg), the generated image (P') is corrected by combining P_face and P'.  Figure 10 shows the results of partial style changes. (a) is the input image of a person who wants to change outfits. There are three different style model images to try-on in (b). Convert the input image to the model image's style. (c) is the attempt to change the colored parts (full, upper-clothes, pants) of the segmentation images. (d) is the result images of changing area (c) of (a) to the (b) style. The resulting images show that only the colored region has been changed while retaining the input image. Figure 11 shows that various and unique fashion items can be made by modifying the segmentation image. As an image is generated based on the semantic map, if the semantic map changes, a new result image is obtained. By using this, clothes of various shapes and lengths can be designed. Length or shape of outfits can be easily revised from long sleeves to short sleeves, long pants to short pants, or an A-line skirt as an H-line skirt. A skirt can be trousers by shifting the shape. Furthermore, if switching the area label from pants to top, you can apply the top style to the pants area.  SEAN can be applied to other fashion synthesis-related problems. For example, it can create an image of a model that poses differently with one image (Figure 12). Or it can be converted to an image from another angle. Additionally, if only the image of the clothes is given, try-on wearing the clothes is possible.

Quantitive Results
The method of judgment by a person is ambiguous and involves human subjectivity. In addition, the greater the number of images to evaluate, the more time it takes. The following metrics are used to quantify the performance of our model and compare our results to SEAN. Four indicators were used: Root-mean-square error (RMSE), peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [16], Frechet Inception distance (FID) [17], and Naturalness Image Quality Evaluator (NIQE) [18]. PSNR is the most common criterion for evaluating images and SSIM is an indicator of the similarity between the two images. FID measures the difference between the two normal distributions. These methods measure the quality difference between two images. However, NIQE is a method that allows quality evaluation without referring to the original image. This method has been used in some applications [19,20]. For SSIM and PSNR, the higher the better. For RMSE, FID, and NIQE, the lower the better.
Some virtual try-on networks have similar goals as us. References [4,21] can try-on what others wear. When a product clothes image is given, References [3,22] can generate the person wearing it. However, they differ from ours in that they cannot manipulate outfits. To compare with a network that performs the same function, we compared the modified SEAN proposed in the paper with SEAN. Each network was trained for 20 epochs. RMSE, PSNR, SSIM, and FID evaluate the reconstruction performance. Images created based on the segmentation map and original style were compared to the original images. NIQE was used to evaluate images changed to various styles through each network. Table 1 shows that our model has better performance on all metrics.

Conclusions
In this paper, modified SEAN was applied to apparel design. In the experiment, it was shown that only certain clothing parts of the image can be modified. In addition, through the change in the semantic map, it was possible to change to a new shape, style, or type of clothes. User-centered design is possible because users can quickly apply the style they want, so it is useful for choosing or recommending garments and fashion items. In addition, it showed that it can be applied to various fashion issues.
Unlike ordinary objects, it is difficult to apply artificial intelligence to clothes because they have many changes in design and style. As real fashion is much more complex and diverse, there are some problems to be solved to apply this paper's methods directly to the fashion industry. Furthermore, compared to the designer's ability, the completeness of the result is low. However, as an auxiliary means, experts can easily prototype ideas and stimulate imagination, which will help in design development. In addition, anyone can design because it can be easily expressed with a simple touch. There is a possibility of development through continuous research.