Deep-Learning-Based Scalp Image Analysis Using Limited Data

: The World Health Organization and Korea National Health Insurance assert that the number of alopecia patients is increasing every year, and approximately 70 percent of adults suffer from scalp problems. Although alopecia is a genetic problem, it is difﬁcult to diagnose at an early stage. Although deep-learning-based approaches have been effective for medical image analyses, it is challenging to generate deep learning models for alopecia detection and analysis because creating an alopecia image dataset is challenging. In this paper, we present an approach for generating a model specialized for alopecia analysis that achieves high accuracy by applying data preprocessing, data augmentation, and an ensemble of deep learning models that have been effective for medical image analyses. We use an alopecia image dataset containing 526 good, 13,156 mild, 3742 moderate, and 825 severe alopecia images. The dataset was further augmented by applying normalization, geometry-based augmentation (rotate, vertical ﬂip, horizontal ﬂip, crop, and afﬁne transformation), and PCA augmentation. We compare the performance of a single deep learning model using ResNet, ResNeXt, DenseNet, XceptionNet, and ensembles of these models. The best result was achieved when DenseNet, XceptionNet, and ResNet were combined to achieve an accuracy of 95.75 and an F1 score of 87.05.


Introduction
According to Korea National Health Insurance [1] in 2021, the number of alopecia patients in Korea increased from 103,000 in 2001 to 145,000 in 2005 and then surged to 233,000 in 2020.As these figures do not reflect patients whose alopecia is caused by genetic factors and aging, the domestic alopecia population is estimated to be approximately 10 million as of 2021.Early treatment is known to be effective for alopecia because the symptoms worsen over time [2].Therefore, it is essential to detect alopecia early.Currently, to diagnose and prevent alopecia, people need to personally visit a specialized scalp clinic.Owing to the increase in the population of alopecia sufferers and consequent interest in preventing alopecia, personal consultations with experts are likely to involve a considerable amount of delay.People can also feel burdened just by visiting a clinic because they spend time and money in clinics.Accordingly, there is a need for a convenient and accessible solution to diagnose and analyze alopecia.Therefore, systems that allow users to easily detect alopecia at home are being actively studied.Several systems have attempted to diagnose alopecia by extracting alopecia characteristics [3] and analyzing the thickness and density of the hair [4], and by using microscope images captured under a portable camera that can be detached from a smartphone.Most of these systems use general image data processing approaches rather than specific data processing customized for analyzing alopecia.
Our contributions are as follows: (1) we created an alopecia-specialized model that can achieve high accuracy even with a limited dataset; (2) we present several data augmentation approaches that are appropriate for alopecia images; (3) we evaluate the performance of ResNet, ResNeXt, DenseNet, and XceptionNet that have been effective for medical data analyses, and then ensemble the models to create alopecia-specialized models.The accuracy was highest when ensembling DenseNet, XceptionNet, and ResNet, specifically, 95.75 (with an F1 score of 87.05).
The rest of this paper is organized as follows: Section 2 reviews existing research on scalp data preprocessing methods and existing models for scalp datasets.Section 3 presents our data preprocessing method, our data augmentation method, and the models that we used for alopecia condition classification.Section 4 includes a description of the model architecture and the results of this research.Section 5 concludes this paper.

Literature Review 2.1. Scalp Data Preprocessing Methods
Kim et al. [3] implemented a method to measure hair density, which is the most basic feature used to diagnose alopecia, through image processing and scalp hair microscope image datasets.As previous research has diagnosed alopecia based on the extent of hair growth, it is necessary to cut the hair.The method by Kim et al. [3] is meaningful in that it uses preprocessing with contrast stretching and morphology processing, converts skeleton images, and applies a search algorithm to identify the endpoint to measure density.However, as this approach focuses only on the density of hair, it has the limitation that it ignores other characteristics of early detection such as dead skin cells and erythema.Kim et al. [4] implemented an alopecia diagnosis system using hair density, thickness, number of hair follicles, and redness, which are indicators that can determine alopecia.Similar to Kim et al. [3], this work uses scalp hair microscope image datasets and a similar preprocessing method.However, the average distance between two points is determined by applying Canny Edge Detection to detect the thickness.The RGB values, especially R values, were compared to determine redness.Although the work considers more diverse alopecia indicators than Kim et al. [3], it does not conduct preprocessing based on the scalp characteristics of each person, and the likelihood of alopecia cannot be expressed in a percentage form.
ESENSEI data mining [5] reduces and normalizes the image to lower the difference in the hue between the images when predicting the location of hair follicles in scalp images.A single image can be used to create eight images by using x and y axis symmetry.As it only uses symmetry, it has the advantage that interpolation is not necessary.However, the degree of enhancement is low.
Trichoscopy of alopecia areata [6] is an algorithm for diagnosing alopecia by extracting HLF from the images captured via a microscope.In Seo et al. [6], image datasets are converted to gray tones to reduce errors arising from color differences.Contrast stretching is used to recover the shadow of the microscope and the area covered by the light reflection of the scalp.Techniques such as contrast stretching reduce noise in images.However, its disadvantage is that it is weak against color changes in hair because it processes images only in gray tones.HLF consists of a hair count, thickness estimation, and follicle count, and each HLF is trained in Seo et al. [6].It is suitable for measuring the features by dividing them as per each cause of alopecia, but it is difficult to use it to consider scalp diseases except the hair condition.
Reference [7] flipped the photo vertically and preprocessed it by a +15−15 • rotation.As a result, in EfficientDet and DetectirRS, mAP (50:95) increased performance by about five to 20, but in YOLOv4, it increased by about one to two, so the performance improvement was small.This is due to the mosaic enhancement of YOLOv4.Mosaic augmentation synthesizes four training images so that it can learn to detect small objects.
Table 1 summarizes the existing scalp data preprocessing works.

Reference
Description Problems [3] Diagnosed alopecia based on the extent of hair growth.Preprocessed using feature extraction algorithm.
Focused only on the density of hair.[4] Used various indicators that can determine alopecia, such as hair density, thickness, and so on.
There was no preprocessing to reflect the characteristics of each person. [5] Normalized the images to lower the difference in hue between the scalp images.
The degree of enhancement is low.[6] Converted to gray tones to reduce errors arising from color differences and created an algorithm for diagnosing alopecia by extracting HLF from scalp images.
Difficult to use it to consider scalp diseases, except for the hair condition.[7] Flip images vertically and rotate +15 to −15 • Small performance improvement on models with data augmentation techniques such as YOLOv4.

Existing Models for Scalp Datasets
Kim et al. [7] implemented an automated measurement of hair density using deep neural networks.In [7], a model that is simply learns hair follicle images and detects hair follicles within a scalp image.Thereafter, hair loss was determined through the number of hair follicles appearing in the image.Reference [8] describes an automatic trichoscopic image analysis model.This model consists of D-Net for trichoscopy image detection and R-Net for prediction.If a trichoscopic image is used as the input of D-Net, then the hair follicle is detected.When this process is finished, the R-Net calculates the number of hairs as well as the proportion of hairs of different types.In the case of the methods in [7,8], if the shape of the race, the shape of the hair follicle, or the distance that the photograph was taken changes, showing low accuracy.In fact, the maximum accuracy of [7] is 75.73%.
ScalpEye [9] is an intelligent scalp examination and diagnosis system based on deep learning for scalp physical therapy.Images of hair under a microscope are used as a training dataset.The system uses Faster R-CNN with the Inception ResNet_v2_Atrous model for examining scalp hair symptoms.The cost and time of educating and training scalp physical therapists can be reduced.However, there is a limitation in that it does not directly serve the user.
Benhabiles et al. [10] designed a system that uses facial images as a dataset to detect alopecia and classify it into seven levels.To reduce overfitting, this approach applies a horizontal reflection, gaussian noise, gaussian blur, and contrast-limited adaptive histogram equalization for data enhancement.As Benhabiles et al. [10] only use methods for general image enhancement, not scalp-specific enhancement processes, it is inadequate to be applied to scalp-specialized models.
Shakeel et al. [11] proposed a framework to classify healthy hairs and alopecia areas using support vector machine (SVM) and K-nearest neighborhood (K-NN).It uses only 200 healthy hair images and 68 alopecia area images.In order to overcome the disadvantage of the small amount of data, Shakeel et al. [11] use image preprocessing and enhancement through histogram equalization (HE).Three features, i.e., hair color, hair texture, and hair shape are extracted.Although Shakeel et al. [11] is meaningful in that each hair feature is extracted and trained; however, it suffers from a high possibility of overfitting because only HE is used for data enhancement.
Ref. [12] used CNN for image classification by automatically extracting features from raw pixel data.The model included ReLU activation, pooling layers to reduce feature map dimensions, and dropout layers to prevent overfitting.A 0.3 dropout rate resulted in 30% of the neurons being dropped randomly in each epoch.However, this model can only judge if the image is alopecia or non-alopecia, so it is difficult to diagnose the progress of one's alopecia.
Reference [13] compares the performance of various machine learning algorithms, including SVM, KNN, the Random Forest classifier, Gaussian Naive Bayes, and CNN, to accurately classify them as alopecia symptoms.Overall, [13] aims to improve the accuracy of dermatology alopecia diagnosis using machine learning techniques.The CNN algorithm showed the highest accuracy at 92%.Reference [14] classified scalp lesion images by adding a convolutional block attention module (CBAM) and spinal FC to the DenseNet classic model.In addition, combining cloud computing and AIoT design architecture can be used in more general situations.As a result, an accuracy of 85.03% was obtained.However, the accuracies in [13,14] are insufficient to diagnose scalp diseases such as alopecia and scalp lesions.
Reference [15] generated additional data using data augmentation (DA), and features were extracted using a VGG-19 pretrained CNN model.Reference [15] proposed VGG-SVM is proven to be 98.31% accurate in simulations using 200 HH images from Figaro1k datasets and 68 AA images from Dermnet datasets.They have a limitation in that they did not take into account racial differences.
Table 2 summarizes the existing models for scalp data.
Table 2. Summary of the existing models for scalp data.

Reference Description Problems [7]
A model that is simply learned hair follicle images and detects hair follicles within a scalp image.
If the race, the shape of the hair follicle, and the location or distance of the picture are different, the accuracy is low.[8] An automatic hair follicle image analysis model which consists of D-Net for trichoscopy image detection, and R-Net for prediction.[9] A system that uses Faster R-CNN with the Inception Res-Net_v2_Atrous model for examining scalp hair symptoms.
There is a limitation in that it does not directly serve the user. [10] A system that uses facial images as a dataset to detect alopecia and classify it into seven levels.
It only uses methods for general image enhancement, and not scalp-specific enhancement processes.[11] A framework that consists of a support vector machine and a K-nearest neighborhood.
It suffers from a high possibility of overfitting because only HE is used for data enhancement.[12] An image classification model, which consists of CNN structure, that extracts the characteristics of alopecia automatically.
It can only judge if the image is alopecia or non-alopecia, so it is difficult to diagnose the progress of one's alopecia.[13] They preprocessed the dataset using image enhancement, segmentation, and data augmentation techniques.They compared the performance of various machine learning algorithms.The accuracy is insufficient to diagnose scalp diseases, such as alopecia and scalp lesions. [14] A scalp lesion image classifier that combines cloud computing and AIoT design architecture with an algorithm that adds a convolutional block attention module (CBAM) and spinal FC to the DenseNet classic model.

Reference
Description Problems [15] They proposed VGG-SVM for alopecia diagnosis, and their algorithm showed the highest accuracy at 98.31%.
They did not take into account racial difference.

Dataset
In this study, an open dataset from AI Hub [16] was used, which was reviewed by three Seoul National University Hospital specialists who set the classification criteria.Data augmentation was performed to increase the amount of data.Scalp images were taken from four sides, i.e., the top of the head, left, right, and larynx, were included, and the training was conducted using these scalp images.In the original dataset, four distinct alopecia conditions were annotated: 526 good (0), 13,156 mild (1), 3742 moderate (2), and 825 severe (3).The numbers 0, 1, 2, and 3 are used to denote each of the conditions.The dataset is rather unbalanced, i.e., there is a high variance in the number of examples for each condition.Example images for each label are presented in Table 3.

Image
In this study, 18,249 original alopecia data were used.To reduce the imbalance of the data, these data were increased by 6-10 times, depending on the distribution of the classes, to 49,118 by applying the data augmentation approaches described in Section 3.2.Then, 39,545 data, 80% of total data, was used as train data, and 9573 data, 20% of all data, as test data.The overall data processing process is shown in Figure 1.When the above process is applied, the augmentation is approximately 6 to 10 times.

Data Augmentation
Each original data was photographed in a different environment, and thus it has different size and background.Therefore, some data has a background that is not related to scalp information, and noise, such as a shadow of a camera that exists at the edge of the data.To remove these noises, each image is cropped to a square of 600 pixels based on the center of the data.Since the original data is over 600 pixels, the shadow and noise appearing at the edge are removed by cropping.Shadows that are not removed by cropping are faded through a preprocessing process, such as color conversion, after the current step.In other words, this paper used 600 × 600 resized images as input data.This can also reduce the learning resources that are required for deep learning models.
Jakubik [5] asserted that the accuracy is the best when the augmentation related to image transformation is applied in training as well as test datasets.The dataset of the scalp is insufficient to generate a high-accuracy model; therefore, data augmentation was added to the original datasets.In the case of the scalp images, the data present very similar characteristics when they are flipped up, down, left, and right; therefore, images were turned vertically and horizontally, rotated 90 • , and then rotated in additional 10 • units.Affine transformation was necessary to distort the scalp images, considering the input from various angles.We distort the color of the datasets in order to increase the accuracy of the images obtained from patients of diverse races and hair colors.The pixel data are then divided by 255 to a scale of (0, 1) intervals for simple normalization so that they can be used as inputs to the CNN-based networks.
The color of the skin varies depending on the lighting.As such, variation is disadvantageous in this analysis, and we apply a normalization process.We extract the color from a reference image and make the color of input images similar to the reference image.In the dataset, it was found that skin color existed within the spectrum of red to blue.We determine one reference image that we think is in the middle range.Then, the color extracted from this image is used to normalize other images.Equations ( 1)-(3) represent the afore-described process.First, M is calculated by obtaining the average of each of the RGB pixel values of image I xy .L is then computed by obtaining the average of all pixels of image I xy .The color is extracted from the reference image I r using the color extraction algorithm, and the color indicators M r and L r are calculated.The color indicators M i and L i for input images I i are also computed in the same manner.Then, we modify the input images I i to create processed images I p using Equation (4).This calculation is performed by adding a color index difference between the reference image and the input image.The dataset contained many images having red, yellow, white, green, and blue colors.One from each of these was selected as an example image I i to analyze the results of image processing for each color.Any image may be used as a reference image I r for extracting a color.We selected the white image as the reference image among the example images.This is because white is expected to reduce the effect of deep neural networks owing to color bias because red, green, and blue colors are the most uniform.Table 4 presents the results of generalizing skin tone for example images by the color classification.Green and blue images were satisfactorily changed, but red and yellow were not.To overcome this issue, we proceeded with the addition of red in the next step.As a strategy for data augmentation, we generate new images by adding a red tone to the existing images.The red addition is calculated simultaneously with the previously mentioned skin tone normalization.We use N(e.g., 5), which indicates how many images will be augmented, and a range ratio for how much red we want to add (e.g., [−0.2, 0.2]).A method of calculating the processing image I rp from the skin tone normalized image p is given in Equation ( 5).The M R r value of the reference image is multiplied by γ, which is a random value from ratio.The ratio is a range of arbitrarily determined values.The γ × M R r value is added to the red value of the input image I R p .This process is repeated N times.As for the user input parameter, the number of output images N was set to 5, and the red processing range ratio was set between −20% to 20%.The parameter was carefully determined to be a value such that the red color was well added and removed in all the randomly sampled images.The red tone was randomly added or subtracted so that the existing amount of red was less affected by the deep neural network learning.Table 5 presents an example of red addition and augmentation.I p denotes the input and I rp denotes the result.
For more color transformation, PCA augmentation described in Krixhevsky et al.  6), where p i and λi are the i th eigenvector and eigenvalue of the 3 × 3 covariance matrix of I xy , respectively.The p i and λi values are calculated by I xy .We set the parameter α value to 0.3.The reason is that if the value is higher than 0.5, some pixels become abnormal as shown in Figure 2, and with a value of 0.1, the change in the RGB pixel value is very small, i.e., between −1 to 1. Table 6 presents the results of PCA augmentation; there is no significant difference in sight, but the neural network can make use of the difference. ) total num , 1 − train dict [1] total num , . . ., 1 − train dict [3] total num

Model Description
For our experiments, we used four CNN models (ResNet, ResNeXt, DenseNet, and XceptionNet) that are known to be effective for various image analysis tasks, including medical image classification.The description of each model is as follows.

ResNet and ResNeXt
Simonyan et al. [18] confirmed that the deeper the network model, the better the performance through the VGG network.There is a limitation in that the performance does not improve after VGG's layer 16, but ResNet allows the performance to improve even in networks deeper than VGG.When applying Resnet, the deeper network has a lower error rate.Chollet et al. [14] explained that ResNeXt has a simpler structure compared to other models (ResNet-101/152, ResNet200) but has better performance for image analysis in general.Increasing the cardinality results in fewer errors than making ResNet deeper and wider.The detailed configuration of the ResNet and ResNeXt is presented in Table 7, and it is expressed based on the reference.We also used DenseNet because it is a CNN model that is frequently compared to Resnet.Xie et al. [19] described that there are advantages in that the number of parameters can be considerably reduced, feature propagation can be enhanced, and vanishing gradient can be prevented.It performs well and has low computational complexity on representative datasets, such as SVHN and ImageNet.The detailed configuration of the model is given in Table 8.  [20] explained that XceptionNet is an Extreme Inception model that is known to perform better using parameters of the same capacity.It is a model that adds depth-wise separable convolution to the Inception model.Furthermore, it is a model with ResNet-based separable convolutions trained upon the ImageNet dataset and is known to perform best for various image qualities.A detailed configuration of the model is given in Figure 3.We used XceptionNet using this configuration, since it performed well with a F1 score of 73.19% without augmentation.The final process for the input data and data augmentation for forming the ensemble is represented in Figure 3.

Experimental Environments
For comparison, we used a model without data augmentation and a model with data augmentation.All experiments were conducted using an Adam optimizer, stepLR scheduler, and the cross-entropy loss function described in Section 3.3, with a batch size of 32 and 30 epochs.

Experimental Result
Table 9 summarizes the best accuracy of each model and the result of learning a model to which the data augmentation process has not been added.The row for each network means the number of layers in network.For example, ResNet101 has 101 layers and ResNet152 has 152 layers.The F1 score and accuracy are presented.Accuracy is the ratio of correctly predicting true/false among all classification results, and the F1 score is a harmonic mean, considering precision and recall value.In this experiment, given that the number of classes is unbalanced, the imbalance of classes was corrected using the F1 score.The F1 score and accuracy are calculated as follows.
The F1 score varies across various classes (scalp conditions).For example, the F1 score for mild (1) is 93.36%; however, the score for moderate (2) is only 21.25%.We believe that such variances are due to the differences in the number of available data for each class.Across the models, DenseNet201 has the highest average F1 score (77.12%) compared to ResNet, ResNeXt, or XceptionNet.
Table 10 lists the F1 scores with data augmentation.Data augmentation was applied to training as well as test datasets, using the augmentation approach is described in Section 3. ResNeXt101 has the highest average F1 score (86.8%) compared to the other models.We noticed that the difference in F1 score for each label also decreased with the data augmentation.Figure 5 lists the F1 score and accuracy of the three models' ensembles with data augmentation.Overall, the performance is further improved.The ensemble model with ResNext101, DenseNet169, and XceptionNet41 has the highest F1 score (87.74%).The ensemble model with DensetNet169, XeptionNet41, and ResNet101 has the highest accuracy (95.75%).The F1 score was 0.24% higher when three models were ensembled, while the accuracy was 0.09% higher when two were ensembled.Furthermore, the F1 score for each label increased more evenly when we ensembled three models.

Conclusions
An increasing number of people suffer from alopecia every year, and it is very difficult to diagnose alopecia early on.It is important to create an AI model that can diagnose alopecia early.However, proper alopecia datasets for the training are not easily found because the data collection requires domain expertise, and the number of available datasets is not very large.This paper presents an approach for improving the classification performance using a set of data augmentation techniques appropriate for scalp images and model ensembles, achieving an accuracy of 95.84% and an F1 score of 87.74%.
Specifically, the number of images was increased by applying geometry-based augmentation through operations such as rotate, vertical flip, horizontal flip, crop, and affine transformation.As the color of the scalp may vary between races, we performed normalization using PCA augmentation, a color-based augmentation technique.For an unbalanced dataset, we applied the focal loss function.
When we evaluated individual classes without data augmentation, the highest F1 score was achieved using DenseNet.When we ensembled two models, ResNext101 and DenseNet169 had the highest F1 score (87.5%).The ensemble model having ResNext101, DenseNet169, and XceptionNet41 achieved the highest F1 score (87.74%).In general, the F1 score for each label increased more evenly when three models were ensembled.
This study used only a microscope image dataset, but for future work, we plan to make use of images captured using regular cameras, which can be used for more general applications.

Figure 1 .
Figure 1.Data preprocessing and data augmentation of scalp images.
[17] was used.First, α, which denotes variance, was determined.The process of calculating the PCA augmented image I pca from the input image I xy = I R xy , I G xy , I B xy T is as given in Equation (

Figure 2 .
Figure 2. Some pixels are overprocessed and show a different color than the surroundings when α has a value higher than 0.5.
As described in Section 3.1, the dataset is highly unbalanced.For training the neural network, we adjusted the focal loss function for the unbalanced dataset.The number of training data is calculated for each class (good, mild, moderate, and severe), and each weight used in the CNN-based model is calculated using Equation (7) below.

Figure 3 .
Figure 3. Deep learning model architecture for alopecia classification.

Figure 4
Figure 4 presents the F1 score and accuracy with ensembles of two models that make use of data augmentation.The ensemble model with ResNeXt101 and DenseNet169 has the highest F1 score (87.5%).The ensemble model with DensetNet169 and XceptionNet41 has the highest accuracy (95.84%).

Figure 4 .
Figure 4. F1 score and accuracy of the best ensemble models with data augmentation.

Figure 5 .
Figure 5. F1 score and accuracy of the triple ensemble model with data augmentation.

Table 1 .
Summary of the existing scalp data preprocessing works.

Table 3 .
Example image for each label.

Table 4 .
Examples of skin color normalization.

Table 5 .
Examples of red addition and augmentation.

Table 7 .
Detailed configuration of the ResNet and ResNeXt.

Table 8 .
Detailed configuration of the DenseNet.

Table 9 .
F1 score of a single network model without data augmentation.

Table 10 .
F1 score of a single network model with data augmentation.