Real-Time Hair Segmentation Using Mobile-Unet

: We described a real-time hair segmentation method based on a fully convolutional network with the basic structure of an encoder–decoder. In one of the traditional computer vision techniques for hair segmentation, the mean shift and watershed methodologies suffer from inaccuracy and slow execution due to multi-step, complex image processing. It is also difﬁcult to execute the process in real-time unless an optimization technique is applied to the partition. To solve this problem, we exploited Mobile-Unet using the U-Net segmentation model, which incorporates the optimization techniques of MobileNetV2. In experiments, hair segmentation accuracy was evaluated by different genders and races, and the average accuracy was 89.9%. By comparing the accuracy and execution speed of our model with those of other models in related studies, we conﬁrmed that the proposed model achieved the same or better performance. As such, the results of hair segmentation can obtain hair information (style, color, length), which has a signiﬁcant impact on human-robot interaction with people.


Introduction
A person's hair information is important in assessing their appearance. Hair varies widely in color, length, pattern, and texture information depending on gender, age, fashion, culture, and personal taste. Appearance information, whether positive or negative, acts as an important factor in human-to-human interactions as well as in human interactions with robots [1]. For robots to have social relationships with people they need to mimic the way people have social relationships [2]. However, a strand of hair is very flexible and thin, which can be transformed into a variety of shapes, similar to skin color, or heavily influenced by external lighting, making it difficult to segment skin, hair, and background areas. Additionally, hair segmentation information can not only complement social robot applications and face recognition results, but it can also be used for a variety of applications, such as make-up changes and character photo editing [3]. Thus, a variety of studies have been conducted recently on hair partitioning and recognition of style information [3][4][5][6], but the study is challenging because hair areas vary widely depending on complex backgrounds, different poses, reflected light, race, hair color, and dyeing.
Earlier hair split and color automatic recognition were proposed by Yacoob [4], and hair areas were extracted using the ratio of face and color information and area grinding methods for the front of the face. Wang [5] suggested a pre-learned hair segmentation classifier to automatically expand into the entire area if the hair seeded area was manually specified. However, this method has the weaknesses of having to manually designate hair areas and of having low division accuracy if the background and hair color are similar. Wang [6] proposed the data-driven isomeric manifold inference method. This method manually enters the hair area from the labeled hair segmentation training image and produces a probability map that enables the generation of the seed area. This method also

Pre-Processing Steps
To achieve DNN-based hair segmentation, several pre-processing methods are required. Here, we describe landmark detection, size normalization, and data augmentation of face images, as well as hair recoloring of a detected hair region.

Landmark Detection and Size Normalization
As an initial experiment, the size of the input image was simply normalized to a size of 224 × 224, which was used as training data without other pre-processing. In this case, the real-time test did not obtain good results according to changes in the environment, such as distance or lighting. To solve this problem, we sought a method to normalize images regardless of size, location, or distance by immobilizing them in a specific reference position rather than simply resizing them and configuring the dataset. Here, we used 68 face landmarks detectors [13] to normalize the input images. The face detector used a single-shot multi-box detector [14] that had relatively fast detection and was highly accurate. We normalized the input image to the central position of the image, as shown in Figure 1, after the detection of landmarks.
Electronics 2021, 10, x FOR PEER REVIEW 3 of 13 such as distance or lighting. To solve this problem, we sought a method to normalize images regardless of size, location, or distance by immobilizing them in a specific reference position rather than simply resizing them and configuring the dataset. Here, we used 68 face landmarks detectors [13] to normalize the input images. The face detector used a single-shot multi-box detector [14] that had relatively fast detection and was highly accurate. We normalized the input image to the central position of the image, as shown in Figure 1, after the detection of landmarks. As shown in Figure 1, we normalized images of any size, by converting them to a size of 224 × 224, so that the nose was at the central position of the image. If the converted area was larger than the original, we set the pixels in the padded background area to 255. The reason the border of the image was not padded to zero was mainly that the border of normal images, or real-time images, was bright instead of dark.

Data Augmentation
If a video is taken in real-time, the input image can be characterized according to changes in distance, pose, camera angle, and light intensity. For training in DNN models, if sufficient training datasets that are the same as a variable input environment cannot be collected, it is common to increase the training data by artificially generating such environmental changes. Consequently, a total of 95,200 items of training data were obtained from 6800 items of training data by using flip and rotation (±5, ±15, ±30) to acquire data, which enabled 14 additional images to be obtained from one image. Figure 2 shows the results of the data augmentation. As shown in Figure 1, we normalized images of any size, by converting them to a size of 224 × 224, so that the nose was at the central position of the image. If the converted area was larger than the original, we set the pixels in the padded background area to 255. The reason the border of the image was not padded to zero was mainly that the border of normal images, or real-time images, was bright instead of dark.

Data Augmentation
If a video is taken in real-time, the input image can be characterized according to changes in distance, pose, camera angle, and light intensity. For training in DNN models, if sufficient training datasets that are the same as a variable input environment cannot be collected, it is common to increase the training data by artificially generating such environmental changes. Consequently, a total of 95,200 items of training data were obtained from 6800 items of training data by using flip and rotation (±5, ±15, ±30) to acquire data, which enabled 14 additional images to be obtained from one image. Figure 2 shows the results of the data augmentation.

Basic Hair Segmentation Model
For hair segmentation, the use of a segmentation model with the basic structure of an encoder-decoder [15] is more appropriate than AlexNet or VGGNet, which are composed of CNN and ResNet architectures with general pipeline structures for detection or classification purposes. Object segmentation, for which location information-such as hair segmentation-is important, requires spatial information for images entering the input layer of the model to be matched with spatial information for the resulting image going out to the output layer. To satisfy this process, the segmentation model takes the form of down-sampling and up-sampling steps, as shown in Figure 3.

Basic Hair Segmentation Model
For hair segmentation, the use of a segmentation model with the basic structure of an encoder-decoder [15] is more appropriate than AlexNet or VGGNet, which are composed of CNN and ResNet architectures with general pipeline structures for detection or classification purposes. Object segmentation, for which location information-such as hair segmentation-is important, requires spatial information for images entering the input layer of the model to be matched with spatial information for the resulting image going out to the output layer. To satisfy this process, the segmentation model takes the form of down-sampling and up-sampling steps, as shown in Figure 3. A model that initiates the image segmentation and introduces an encoder-decoderstyle restoration structure as the core of the model is the FCN model for semantic segmentation [8]. However, the FCN does not fully recover the segmentation because it uses an asymmetric method that goes from a very small channel to a large channel instead of a format in which dimensions are recovered at a certain rate during up-sampling. The model that addresses the asymmetry of the skip-connection and up-sampling is the U-Net model. Therefore, we used the U-Net model as a basis for the hair segmentation model.

Basic Hair Segmentation Model
For hair segmentation, the use of a segmentation model with the basic structure of an encoder-decoder [15] is more appropriate than AlexNet or VGGNet, which are composed of CNN and ResNet architectures with general pipeline structures for detection or classification purposes. Object segmentation, for which location information-such as hair segmentation-is important, requires spatial information for images entering the input layer of the model to be matched with spatial information for the resulting image going out to the output layer. To satisfy this process, the segmentation model takes the form of down-sampling and up-sampling steps, as shown in Figure 3. A model that initiates the image segmentation and introduces an encoder-decoderstyle restoration structure as the core of the model is the FCN model for semantic segmentation [8]. However, the FCN does not fully recover the segmentation because it uses an asymmetric method that goes from a very small channel to a large channel instead of a format in which dimensions are recovered at a certain rate during up-sampling. The model that addresses the asymmetry of the skip-connection and up-sampling is the U-Net model. Therefore, we used the U-Net model as a basis for the hair segmentation model. A model that initiates the image segmentation and introduces an encoder-decoderstyle restoration structure as the core of the model is the FCN model for semantic segmentation [8]. However, the FCN does not fully recover the segmentation because it uses an asymmetric method that goes from a very small channel to a large channel instead of a format in which dimensions are recovered at a certain rate during up-sampling. The model that addresses the asymmetry of the skip-connection and up-sampling is the U-Net model. Therefore, we used the U-Net model as a basis for the hair segmentation model.

Hair Recoloring Processing
When the hair is correctly segmented, it is possible to recolor the hair various colors that differ from the input hair color. Hair recoloring has the advantage of allowing users to experience what colors would match when they change their hair color. To automatically convert hair colors, we used a method to convert input images into intensity images using color maps to select desired colors and to match them in the lookup table (LUT) from brightness values [16]. Figure 4 shows an example of the hair recoloring process.

Hair Recoloring Processing
When the hair is correctly segmented, it is possible to recolor the hair various colors that differ from the input hair color. Hair recoloring has the advantage of allowing users to experience what colors would match when they change their hair color. To automatically convert hair colors, we used a method to convert input images into intensity images using color maps to select desired colors and to match them in the lookup table (LUT) from brightness values [16]. Figure 4 shows an example of the hair recoloring process.

Proposed Method
In human-robot interaction, where real-time interaction is essential, the hair segmentation model using the traditional U-Net model has difficulty in real-time processing. To solve this, we propose the use of the two optimization algorithms (depth-wise separable convolution and inverted residual block) used by MobileNetV2 [12] in the traditional U-Net.

Depth-Wise Separable Convolution
The depth-wise separable convolution operation is an optimization technique that reduces the computational capacity by performing 1 × 1 component operations on each channel instead of the normal convolution operation. Figure 5a presents general convolution and Figure 5b presents depth-wise separable convolution. The U-Net in real-time processing must reduce the convolution operations that require the most operations. Depth-wise separable convolution can be seen as a factorization of this common convolution.
In Equation (1), DK represents the kernel size, DF represents the size of the input image, M represents the input channel, and N represents the output channel. Cost represents the number of computations.

Proposed Method
In human-robot interaction, where real-time interaction is essential, the hair segmentation model using the traditional U-Net model has difficulty in real-time processing. To solve this, we propose the use of the two optimization algorithms (depth-wise separable convolution and inverted residual block) used by MobileNetV2 [12] in the traditional U-Net.

Depth-Wise Separable Convolution
The depth-wise separable convolution operation is an optimization technique that reduces the computational capacity by performing 1 × 1 component operations on each channel instead of the normal convolution operation. Figure 5a presents general convolution and Figure 5b presents depth-wise separable convolution. The U-Net in real-time processing must reduce the convolution operations that require the most operations. Depthwise separable convolution can be seen as a factorization of this common convolution.
When the hair is correctly segmented, it is possible to recolor the hair various colors that differ from the input hair color. Hair recoloring has the advantage of allowing users to experience what colors would match when they change their hair color. To automatically convert hair colors, we used a method to convert input images into intensity images using color maps to select desired colors and to match them in the lookup table (LUT) from brightness values [16]. Figure 4 shows an example of the hair recoloring process.

Proposed Method
In human-robot interaction, where real-time interaction is essential, the hair segmentation model using the traditional U-Net model has difficulty in real-time processing. To solve this, we propose the use of the two optimization algorithms (depth-wise separable convolution and inverted residual block) used by MobileNetV2 [12] in the traditional U-Net.

Depth-Wise Separable Convolution
The depth-wise separable convolution operation is an optimization technique that reduces the computational capacity by performing 1 × 1 component operations on each channel instead of the normal convolution operation. Figure 5a presents general convolution and Figure 5b presents depth-wise separable convolution. The U-Net in real-time processing must reduce the convolution operations that require the most operations. Depth-wise separable convolution can be seen as a factorization of this common convolution.
In Equation (1), DK represents the kernel size, DF represents the size of the input image, M represents the input channel, and N represents the output channel. Cost represents the number of computations. In Equation (1), D K represents the kernel size, D F represents the size of the input image, M represents the input channel, and N represents the output channel. Cost represents the number of computations.
Equation (2) shows the total computational capacity of the depth-wise separable convolution operation. The difference between the arithmetic of (1) and (2) is shown in (3).
Electronics 2021, 10, 99 6 of 12 The Diff_Ratio indicates the difference in the ratio between Equations (1) and (2). This is about 8 to 9 times the difference, proving that the depth-wise separation operation is much faster than the normal convolution operation.

Inverted Residual Block
The inverted residual block structure is a method that reduces the amount of calculation by replacing a common structure that is connected between feature maps with a structure with large channels between 1 × 1 component bottlenecks. Residual blocks indicate the beginning and end of a convolutional block with a skip connection. By adding these two states, the network has the ability to access earlier activations that were not modified in the convolutional block. This method was revealed to be essential to building large depth networks. When looking more closely at the skip connection, it becomes clear that an original residual block follows a wide to narrow to wide approach, concerning the number of channels, if the input channel has a large number of channels that are compressed with an inexpensive 1 × 1 convolution. In that case, the following 3 × 3 convolution has far fewer parameters. To add input and output at the end, the number of channels is increased again using another 1 × 1 convolution.
On the other hand, an inverted residual block follows a narrow to wide to narrow approach. The first step widens the network using a 1 × 1 convolution because the following 3 × 3 depth-wise convolution has already greatly reduced the number of parameters. Afterward, another 1 × 1 convolution "squeezes" the network in order to match the initial number of channels. The inverted residual block method is very effective in terms of efficiency when we use GPUs. A GPU has internal and external memories, so the size of the lift and the size of the drop are the most important. Considering the memory swap, the low number of channels in the input layer and the last output layer indicates that it is efficient. This is an important attribute for mobile robots with limited memory. Figure 6 shows the configuration of the inverted residual block.
Equation (2) shows the total computational capacity of the depth-wise separable convolution operation. The difference between the arithmetic of (1) and (2) is shown in (3).
The Diff_Ratio indicates the difference in the ratio between Equations (1) and (2). This is about 8 to 9 times the difference, proving that the depth-wise separation operation is much faster than the normal convolution operation.

Inverted Residual Block
The inverted residual block structure is a method that reduces the amount of calculation by replacing a common structure that is connected between feature maps with a structure with large channels between 1 × 1 component bottlenecks. Residual blocks indicate the beginning and end of a convolutional block with a skip connection. By adding these two states, the network has the ability to access earlier activations that were not modified in the convolutional block. This method was revealed to be essential to building large depth networks. When looking more closely at the skip connection, it becomes clear that an original residual block follows a wide to narrow to wide approach, concerning the number of channels, if the input channel has a large number of channels that are compressed with an inexpensive 1 × 1 convolution. In that case, the following 3 × 3 convolution has far fewer parameters. To add input and output at the end, the number of channels is increased again using another 1 × 1 convolution.
On the other hand, an inverted residual block follows a narrow to wide to narrow approach. The first step widens the network using a 1 × 1 convolution because the following 3 × 3 depth-wise convolution has already greatly reduced the number of parameters. Afterward, another 1 × 1 convolution "squeezes" the network in order to match the initial number of channels. The inverted residual block method is very effective in terms of efficiency when we use GPUs. A GPU has internal and external memories, so the size of the lift and the size of the drop are the most important. Considering the memory swap, the low number of channels in the input layer and the last output layer indicates that it is efficient. This is an important attribute for mobile robots with limited memory. Figure 6 shows the configuration of the inverted residual block.

Proposed Mobile-Unet
U-Net combined with the optimization techniques described in the previous section is called Mobile-Unet. Figure 7 shows the brief architecture of the proposed Mobile-Unet model.

Proposed Mobile-Unet
U-Net combined with the optimization techniques described in the previous section is called Mobile-Unet. Figure 7 shows the brief architecture of the proposed Mobile-Unet model.
The input size started at 224 × 224. The image size was reduced by each step to an image size of 7 × 7 after a total of five steps of contraction. In the next stage, we went through an expansion phase of the five steps symmetrically. Because the expansion phase is a process of restoring information, the layer structure was built to be the same as that of the contraction phase. Table 1 shows the Mobile-Unet contraction process, where T is the channel extension factor and C is the final number of channels after the last convolution operation. N indicates the number of times the operation is repeated for the inverted bottleneck and S indicates the stride. The best empirical result was yielded when the expansion factor T was set to 6. The input size started at 224 × 224. The image size was reduced by each step to an image size of 7 × 7 after a total of five steps of contraction. In the next stage, we went through an expansion phase of the five steps symmetrically. Because the expansion phase is a process of restoring information, the layer structure was built to be the same as that of the contraction phase. Table 1 shows the Mobile-Unet contraction process, where T is the channel extension factor and C is the final number of channels after the last convolution operation. N indicates the number of times the operation is repeated for the inverted bottleneck and S indicates the stride. The best empirical result was yielded when the expansion factor T was set to 6.  Table 2 shows the process of the Mobile-Unet expansion in detail. The number of repetitions was set to 1, and the image size was doubled for each inverted residual operation. We set the expansion ratio to 6. The final step calculates the score from 224 × 224 × 16 to 224 × 224 × 3 and then to 224 × 224 × 1.   Table 2 shows the process of the Mobile-Unet expansion in detail. The number of repetitions was set to 1, and the image size was doubled for each inverted residual operation. We set the expansion ratio to 6. The final step calculates the score from 224 × 224 × 16 to 224 × 224 × 3 and then to 224 × 224 × 1.

Experimental Results
Here, the real-time hair segmentation using the Mobile-Unet method proposed in this study was evaluated. In our experiments, the hair segmentation datasets, including ETRIHair data, the experimental environment and results, and performance comparison are described.

Hair Segmentation Datasets
Public hair segmentation datasets that include hair-partitioned masks are relatively small, which makes it difficult to collect enough data to train with DNNs. The effort and time required to build hair segmentation datasets are greater than when building another training dataset for the face or human body because it is particularly difficult to divide thin hair areas into different hair segmentation areas to create training target masks. The currently released hair segmentation datasets [17,18] employ some of the existing large face datasets and manually partition hair partitions to build the training dataset. Table 3 shows the sets of data we used for training to develop the proposed algorithm. The first dataset was collected by extracting some data from the Labeled Faces in the Wild (LFW) dataset and labeling a hair region. This dataset, called the Parts Labels dataset [17], has a total of three labeled areas-hair, background, and face-and consists of about 2900 images that are not pre-processed and are publicly available. The pixel value of the hair areas in Parts Labels is set to 255 and the pixels of the other mask image for the rest of the areas are set to 0 to handle the binary division problem. The second dataset, called the CelebBHair dataset [18], was collected by applying a hair mask to some images selected from the CelebA face datasets. This dataset consists of about 3500 images and has the same background, face, and hair area divisions as Parts Labels. The same pre-processing used on the Part Labels datasets was also carried out on these images for use in this study.
The third ETRIHair dataset collected about 1200 face data of Asian people and the hair segmentation was marked manually. The reason for also collecting Asian hair data is that the majority of the two previously released datasets consists of European hair data. After the initial training, we tested the hair segmentation DNN model using only the first two datasets. The accuracy for Asians was measured to be about 30% lower than that for Europeans based on an interaction of union (IOU) evaluation. If the dataset is developed using the proposed pre-processing method on these two datasets, the data will consist of approximately 6000 images. Many data items were excluded as there was a large number of noise images, which are those that do not include the hair region or occluded the hair region or images with faces that could not be detected by the face detector [14], such as those where most of the face was covered by hair or the back of the head. To address this issue, the data could have been corrected using a hair segmentation model that was already developed [3,9], but there was no guarantee that the results would be consistently well-segmented in various environments. Therefore, the hair segmentation was carried out using manual image segmentation software. Despite these tasks, the final set of data used for training the DNN model, excluding non-calibrating data, consisted of 6800 images with 1000 images for verification.

Experimental Environment and Results
To develop the DNN model, we used the PyTorch deep learning framework in Python. The GPUs we used for training employed parallel training with two NVidia GTX TITAN-Xs.
The input images went through the normalization process after switching to Tensor, the batch size was set to 128, and for the loss, we used BCELoss. We used the Stochastic Gradient Descent (SGD) optimizer (learning rate = 0.001, weight decay = 1 × 10 −8 , momentum = 0.9 and nesterov = "True"). Of the 95,200 images obtained for hair segmentation, we randomly selected 90% for training and 10% for testing. Table 4 shows the processing time for each module. As shown in Table 4, the proposed Mobile-Unet module had almost the same speed as face detection, indicating that real-time responses are possible. Our hardware specification for the inference test is a notebook with an Intel I7-8750H CPU, 16 G of memory, and a GTX1060 graphics card. The accuracy of hair segmentation can be evaluated by various criteria. Here, we made measurements based on the Dice coefficient and IOU evaluation criteria. Table 5 shows the results of racial hair segmentation. As described in Section 4.1, our study obtained segmentation results that were similar to those for Europeans, as we learned by adding an Asian database to the evaluation. However, Asian hair is still less accurate in segmentation than European hair. The reason for this is that additional Asian training data from ETRIHair were used, but the overall training data rate was smaller than that of Europeans. Moreover, Asians have a lot of short hairstyles for both men and women, because the area of hair is relatively small compared to the entire face, so even untrained a slightly different hairstyle increased the IOU error.  Table 6 shows that the accuracy for men's hair is generally high, at 93.8% on average, because men's hair is relatively less diverse. In contrast, women's hair has a relatively wide variety of hairstyles. Additionally, women's hair has a wide variety of hair colors, and if similar hair colors do not exist in the training dataset hair segmentation does not work well. The difference in accuracy between men and women based on the IOU criterion was 11.6%.  Figure 8 compares the performance of the recently proposed hair segmentation methods and the proposed methods. We used the IOU performance criterion. In most cases, the results show good performance based on the CELEbA test dataset. Compared with Unet [15], the accuracy is similar to our method, but the inference time is over ten times slower. The highest method in terms of accuracy was DeepLab V3 [19], which achieved an accuracy of 91.2% based on the CELEbA test dataset, but they had two stages (hair segmentation and a hair detection stage). If they do not use a hair detection stage, the accuracy rate is 89.9%, which is the same score as our proposed method. DeepLab V3's execution time was 55 ms, so it was not fast when compared with our models in Figure 8. The best method in terms of execution time was the Tkachenka [20] method, which had the fastest real-time speed of 6 ms. However, the accuracy was only 80.2%; the trade-off between speed and accuracy was obvious. The Mobile-Unet proposed in this study achieved an accuracy of 89.9% on the CELEbA test dataset. The execution time was 13 ms, which is the second-fastest among the methods shown on the graph. Thus, considering the proposed Mobile-Unet method's accuracy and execution time together, we can see that it is more efficient than the other methods.

Average
93.8% 82.1% Figure 8 compares the performance of the recently proposed hair segmentation methods and the proposed methods. We used the IOU performance criterion. In most cases, the results show good performance based on the CELEbA test dataset. Compared with Unet [15], the accuracy is similar to our method, but the inference time is over ten times slower. The highest method in terms of accuracy was DeepLab V3 [19], which achieved an accuracy of 91.2% based on the CELEbA test dataset, but they had two stages (hair segmentation and a hair detection stage). If they do not use a hair detection stage, the accuracy rate is 89.9%, which is the same score as our proposed method. DeepLab V3's execution time was 55 ms, so it was not fast when compared with our models in Figure 8. The best method in terms of execution time was the Tkachenka [20] method, which had the fastest real-time speed of 6 ms. However, the accuracy was only 80.2%; the trade-off between speed and accuracy was obvious. The Mobile-Unet proposed in this study achieved an accuracy of 89.9% on the CELEbA test dataset. The execution time was 13 ms, which is the second-fastest among the methods shown on the graph. Thus, considering the proposed Mobile-Unet method's accuracy and execution time together, we can see that it is more efficient than the other methods.  Table 7 shows the comparison of the computational and memory complexity between DeepLab V3, the original Unet, and the proposed method. The proposed method measures approximately 4.2 times faster than DeepLab V3 in inference time and the number of parameters that determine memory complexity use approximately 4.2 percent only that of DeepLab V3.  Table 7 shows the comparison of the computational and memory complexity between DeepLab V3, the original Unet, and the proposed method. The proposed method measures approximately 4.2 times faster than DeepLab V3 in inference time and the number of parameters that determine memory complexity use approximately 4.2 percent only that of DeepLab V3.  Figure 9 visually depicts the results of hair segmentation on test datasets obtained using the proposed method. For baldness, other algorithms often show the wrong results, but in our method, the results are output without hair segmentation results.  Figure 10 shows the poor results obtained by our method, which occurred because of noise regions such as a hand, hairbrush, or bounded input images.

Conclusions
We have described a real-time hair segmentation method based on the Mobile-Unet model with optimization techniques of depth-wise separable convolution and inverted residual block technology. Optimization is an important task for real-time processing in mobile devices and embedded edge computing. In general, increasing the number of layers in deep neural networks can improve recognition performance, but it also increases processing time. Therefore, a balance between performance and speed is required. The proposed algorithm has balance in terms of performance and speed. In experiments, the results produced an average hair segmentation accuracy of 89.9% with a 32 ms processing time for different genders and races. By comparing the accuracy and execution speed with other models in related studies, we confirmed that the proposed model achieved equal    Figure 9 visually depicts the results of hair segmentation on test datasets obtained using the proposed method. For baldness, other algorithms often show the wrong results, but in our method, the results are output without hair segmentation results.  Figure 10 shows the poor results obtained by our method, which occurred because of noise regions such as a hand, hairbrush, or bounded input images.

Conclusions
We have described a real-time hair segmentation method based on the Mobile-Unet model with optimization techniques of depth-wise separable convolution and inverted residual block technology. Optimization is an important task for real-time processing in mobile devices and embedded edge computing. In general, increasing the number of layers in deep neural networks can improve recognition performance, but it also increases processing time. Therefore, a balance between performance and speed is required. The proposed algorithm has balance in terms of performance and speed. In experiments, the results produced an average hair segmentation accuracy of 89.9% with a 32 ms processing time for different genders and races. By comparing the accuracy and execution speed with other models in related studies, we confirmed that the proposed model achieved equal

Conclusions
We have described a real-time hair segmentation method based on the Mobile-Unet model with optimization techniques of depth-wise separable convolution and inverted residual block technology. Optimization is an important task for real-time processing in mobile devices and embedded edge computing. In general, increasing the number of layers in deep neural networks can improve recognition performance, but it also increases processing time. Therefore, a balance between performance and speed is required. The proposed algorithm has balance in terms of performance and speed. In experiments, the results produced an average hair segmentation accuracy of 89.9% with a 32 ms processing time for different genders and races. By comparing the accuracy and execution speed with other models in related studies, we confirmed that the proposed model achieved equal performance and better speed. Additionally, the proposed method showed similar performance on Asian hair segmentation to European hair segmentation using the proposed ETRIHair dataset and augmentation approach. In addition, there was a performance rate decrease of more than 20% in the experiment if the proposed Asian datasets were not used. Furthermore, we may study new models for more accurate and detailed hair segmentation by adding new optimization algorithm such as knowledge distillation.  Data Availability Statement: Data available on request due to privacy. The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.