Development and Experimental Evaluation of Machine-Learning Techniques for an Intelligent Hairy Scalp Detection System

Featured Application: Deep learning, decision tree, linear discriminant analysis (LDA), support vector machines (SVMs), k-nearest neighbors algorithm (K-NN), and ensemble learning are evaluated for detecting hairy scalp problems. To the best of our knowledge, we are the ﬁrst case study to apply modern machine learning to the diagnosis and analysis of hairy scalp issues. Abstract: Deep learning has become the most popular research subject in the ﬁelds of artiﬁcial intelligence (AI) and machine learning. In October 2013, MIT Technology Review commented that deep learning was a breakthrough technology. Deep learning has made progress in voice and image recognition, image classiﬁcation, and natural language processing. Prior to deep learning, decision tree, linear discriminant analysis (LDA), support vector machines (SVM), k-nearest neighbors algorithm (K-NN), and ensemble learning were popular in solving classiﬁcation problems. In this paper, we applied the previously mentioned and deep learning techniques to hairy scalp images. Hairy scalp problems are usually diagnosed by non-professionals in hair salons, and people with such problems may be advised by these non-professionals. Additionally, several common scalp problems are similar; therefore, non-experts may provide incorrect diagnoses. Hence, scalp problems have worsened. In this work, we implemented and compared the deep-learning method, the ImageNet-VGG-f model Bag of Words (BOW), with machine-learning classiﬁers, and histogram of oriented gradients (HOG)/pyramid histogram of oriented gradients (PHOG) with machine-learning classiﬁers. The tools from the classiﬁcation learner apps were used for hairy scalp image classiﬁcation. The results indicated that deep learning can achieve an accuracy of 89.77% when the learning rate is 1 × 10 − 4 , and this accuracy is far higher than those achieved by BOW with SVM (80.50%) and PHOG with SVM (53.0%).


Introduction
In recent years, machine-learning techniques have been widely used in computer vision, image recognition, stock market analysis, medical diagnosis, natural language processing, voice/speech recognition, etc. Machine learning is an aspect of artificial intelligence (AI) that represents another widely used term for AI.Therefore, AI research has shifted from reasoning as the most vital aspect to knowledge and then learning as the most important aspects, which represents a natural and distinct sequence of events.Machine learning has become a methodology of realizing AI, and it also represents a method of resolving issues associated with AI.Historically, huge quantities of data had to be analyzed to accomplish many different tasks, and such analyses represented a methodology used for policy making and modeling; however, machine learning provides a more effective and productive replacement methodology for acquiring knowledge.
Machine learning can progressively improve forecast modeling functions and capabilities and utilize relevant data to assist with policy formulation.Therefore, this approach has become an increasingly important tool in computer science-related research and has played a progressively vital role in people's daily lives.Machine learning has provided more advanced forms of junk email filters, better software for voice/speech recognition, and reliable internet search engines, among other benefits.Deep learning has attracted increasing attention in academia and industry and represents the most popular research direction in the fields of AI and machine learning.Even MIT Technology Review commented in October 2013 that deep learning was a breakthrough technology [1].Deep learning has made progress in voice and image recognition, image classification, and natural language processing.Prior to deep learning, support vector machines (SVM) represented a popular approach to solving classification problems.
In this paper, we use the diagnosis and analysis of hairy scalps as a case study for machine-learning techniques.In this case study, we implemented, evaluated, and compared the deep learning method with the ImageNet-VGG-f model Bag of Words (BOW), with machine learning classifiers, and histogram of oriented gradients (HOG)/pyramid histogram of oriented gradients (PHOG) with machine-learning classifiers.We selected scalp state detection as our case study because people have many hairy scalp lesions caused by work pressure, long working hours, and a lack of scalp care, among other reasons.
In the hair salon industry, dermatology clinics and medical clinics, human methods of hairy scalp state detection are frequently employed.The professional education and training costs in these industries are very high.Moreover, the accuracy of hairy scalp state recognition varies among individuals and does not follow a set of criteria.
In this paper, we test whether machine learning technology can be applied to hairy scalp detection.A machine learning-based hairy scalp detector can automatically recognize the state of the hairy scalp.Moreover, machine learning can continue to train learning algorithms to increase the accuracy of scalp detection.We believe that machine learning-based AI image processing methods should be able to effectively solve the aforementioned hairy scalp detection problem.By installing machine-learning technology into a scalp state detector, the use of human-based assessments and the resulting errors can be reduced.To the best of our knowledge, this is the first study to apply modern machine-learning techniques to the diagnosis and analysis of hairy scalp problems.
The remainder of this paper is organized as follows.Section 2 introduces the preliminary assessment of machine-learning techniques.Section 3 contains a review of previous works on the design and development of image classification and recognition using machine-learning techniques.Section 4 presents the machine-learning techniques evaluated for diagnosing and analyzing hairy scalps.Section 5 describes the experimental results and provides a related discussion.Section 6 presents our conclusions and plans for future work.

Preliminaries
Figure 1 describes the principles [2] of building a machine-learning system.Machine-learning technology as a predictive modeling system can be divided into four parts: preprocessing, learning, evaluation, and prediction.Generally, the raw data and format of images cannot be directly used for calculations because such data could contain a considerable amount of useless information that may have a negative impact on the performance of the learning algorithms.Hence, the preprocessing of raw data plays an import role within every application of machine learning systems.In this work, the raw data are scalp images, and we try to capture the key features.Each machine-learning algorithm has its own advantages for solving specific problems, and one learning algorithm cannot manage all problems.Hence, to train and discover the best models, different learning algorithms could be applied to the same dataset.Comparisons of the training data results for each learning algorithm can be used to identify the most suitable learning model.
In the 1980s, multilayer perceptron (MLP) was a very popular machine learning technique, especially for speech recognition and image recognition.However, since the 1990s, MLP has encountered strong competition from the simpler support vector machines (SVM).Recently, due to the success of deep learning, MLP has regained attention.As we know, modern deep-learning techniques were based on the MLP structure.
MLP includes at least one hidden layer (in addition to one input layer and one output layer).Compared to the single-layer perceptron (SLP), which can only learn linear functions, MLP can learn non-linear functions.Figure 2 shows a simple structure of an MLP with a hidden layer.Please note that all connections are weighted, but only three weights (w0, w1, and w2) are marked in Figure 2. A simple structure of an MLP is introduced as follows.The key features are usually the color, luminance, and density of hair.Features are selected to meet particular restrictions so that they perform well when using machine-learning algorithms.A good machine-learning algorithm predicts good results in both training and new datasets.The datasets are usually automatically divided into two parts (training and testing datasets).The training dataset is for training and optimizing the models, and the testing dataset is for evaluating the efficiency of the model.
Each machine-learning algorithm has its own advantages for solving specific problems, and one learning algorithm cannot manage all problems.Hence, to train and discover the best models, different learning algorithms could be applied to the same dataset.Comparisons of the training data results for each learning algorithm can be used to identify the most suitable learning model.
In the 1980s, multilayer perceptron (MLP) was a very popular machine learning technique, especially for speech recognition and image recognition.However, since the 1990s, MLP has encountered strong competition from the simpler support vector machines (SVM).Recently, due to the success of deep learning, MLP has regained attention.As we know, modern deep-learning techniques were based on the MLP structure.
MLP includes at least one hidden layer (in addition to one input layer and one output layer).Compared to the single-layer perceptron (SLP), which can only learn linear functions, MLP can learn non-linear functions.Figure 2 shows a simple structure of an MLP with a hidden layer.Please note that all connections are weighted, but only three weights (w0, w1, and w2) are marked in Figure 2. A simple structure of an MLP is introduced as follows.Each machine-learning algorithm has its own advantages for solving specific problems, and one learning algorithm cannot manage all problems.Hence, to train and discover the best models, different learning algorithms could be applied to the same dataset.Comparisons of the training data results for each learning algorithm can be used to identify the most suitable learning model.
In the 1980s, multilayer perceptron (MLP) was a very popular machine learning technique, especially for speech recognition and image recognition.However, since the 1990s, MLP has encountered strong competition from the simpler support vector machines (SVM).Recently, due to the success of deep learning, MLP has regained attention.As we know, modern deep-learning techniques were based on the MLP structure.
MLP includes at least one hidden layer (in addition to one input layer and one output layer).Compared to the single-layer perceptron (SLP), which can only learn linear functions, MLP can learn non-linear functions.Figure 2 shows a simple structure of an MLP with a hidden layer.Please note that all connections are weighted, but only three weights (w0, w1, and w2) are marked in Figure 2. A simple structure of an MLP is introduced as follows.

Input layer:
The input layer has three nodes.The offset node value is 1.The other two nodes take external inputs from X1 and X2 (all are digital values based on the input data set).As discussed above, no calculation is performed on the input layer; thus, the output of the input layer node is 1, and X1 and X2 are passed to the hidden layer.
Hidden layer: The hidden layer also has three nodes.The offset node output is 1.The output of the other two nodes of the hidden layer depends on the output of the input layer (1, X1, and X2) and the weight attached to the connection (boundary).Figure 2 shows the calculation of an output in a hidden layer (highlighted).The output calculations of the other hidden nodes are the same.Please note that "f" refers to the activation function.These outputs are passed to the nodes of the output layer.
Output layer: The output layer has two nodes, receives input from the hidden layer, and performs calculations similar to the highlighted hidden layer.These calculated values (Y1 and Y2), which are the result of the calculation, are the outputs of the MLP.
As a result, given a set of features X = (x1, x2, . . . ) and a goal Y, an MLP can learn the relationship between features and goals for the purpose of classification or regression.
To assess the effectiveness of a given model, the accuracy of different models is compared.In this work, we have applied several popular machine-learning algorithms to analyze and diagnose hairy scalp images.First, we adopt a manual method of classifying hairy scalps and comparing the classification of the model accuracy to determine the quality of recognition.
Consequently, test datasets can be used to test the proposed models.Then, suitable models can be selected based on the training dataset.The performance of the test dataset compared with that of an untested dataset must be determined along with the error rate and performance.Subsequently, optimal models can be applied to predict new and future data.
For traffic applications [3][4][5][6][7][8]19], Lousier and Abdelkrim [3] proposed a bag of features (Bove)-based machine learning framework for image classification, and this assessed the performance of training models using different image classification algorithms on the Caltech 101 images [4].These authors also adopted the proposed BoF-based machine-learning framework to identify stop sign images for applying the trained classifier in a robotic system.Ahmed et al. [5] presented text recognition that adopted convolutional neural networks (CNNs) as their deep-learning classifier for detecting and recognizing Arabic text.The error rate of their proposed recognition methodology was 15% using cursive script scene data.
Jagannathan et al. [6] implemented an embedded system-based object detection and classification mechanism that adopted a commercial SoC solution.The main components of this commercial SoC solution are fixed/floating-point dual DSP cores, a fully programmable VisionAccelerationPac (EVE), dual ARM-based Cortex M4 cores, and an image signal processor.In this work, the authors combined two methodologies, an AdaBoost cascade classifier with 10 HOG features for object detection and a 7-layer CNN classifier for objects classification.The implemented methodologies can achieve accuracies of 74.6% for pedestrian object detection, 79.4% for vehicle object detection, 79.6% for traffic sign object detection, and 89.6% for traffic sign objects classification.
Du et al. [7] proposed an end-to-end deep learning model that adopted the Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) architecture to predict real-time vehicular ego-motion for an autonomous driving system.The error rate of their proposed CNN-LSTM-based model for multiple ego-motion classification was 0.0417.Pop et al. [8] proposed different cross-modality CNN-based deep-learning training approaches for pedestrian recognition, and they consisted of a correlated model, an incremental model, and a particular cross-modality model.
For hyperspectral image classification applications [9][10][11][12][13][14], Ermushev and Balashov [9] developed a complex machine-learning technique for target detection from ground radar images called CTDCM (complex ground radar target detection and classification method), which was based on a three-step analysis that included a learning procedure, data pre-processing examination, and target-labeling step coupled with classification.The classification error rate of their proposed CTDCM was 12%.
Zhao and Du [10] proposed a spectral-spatial feature-based classification framework that integrated deep-learning techniques with dimension reductions for hyperspectral image classification.This work overcomes the insufficient discovery of objects that present considerable variations of shape in fixed detection windows.Zhong et al. [11] designed a supervised deep-learning framework that alleviated the declining accuracy of deep-learning models.The proposed work classified many agricultural and urban hyperspectral imagery data sets (Indian Pines, Kennedy Space Center, and University of Pavia).
Qader et al. [12] classified the types of vegetation extracted by satellite-based phonological characteristics in Iraq and achieved an overall accuracy of 85%.Chen et al. [13] applied exclusive and hierarchical relationships to enhance the classification accuracy of multiple-label scenes.In this work, the authors combined these two relationships with a LSTM to form an accurate CNN-based scene classifier.
For the other applications [15][16][17][18][19][20][21], Zhang et al. [15] proposed a covariance descriptor that combined visual and geometric information.Moreover, this work integrated a classification framework with dictionary learning for the object recognition of 3D point clouds.Remez et al. [16] presented a full CNN-based deep learning architecture for image de-noising that uses the splitting scheme to achieve sub-optimization.Nasr et al. [21] used multi-class SVMs that classified facial images for robotic applications.This work adopted BoF as a face representation.In their work, scale-invariant feature transform (SIFT) features were replaced by speeded up robust features (SURF) for rapid and accurate extractions.Moreover, SURF was also used for selecting interesting points.
Besides, a few hairy scalp issues were also discussed and researched [22][23][24], Shih and Lin [22] developed hair segmentation and counting algorithms which were based on an unsupervised mechanism for diagnosing person's hair of health condition.Nakajima and Sasaki [23] proposed an automatic health monitoring system which two subjects had thinning hair due to aging.In their work, hair whirl was recognized by the close circle.Lee et al. [24] developed and manufactured a carbon nanotube/adhesive polydimethylsiloxane-based electroencephalograph (EEG) electrode which EEG signals can be read and recorded from the hairy scalp.

Machine-Learning Techniques for Diagnosing and Analyzing Hairy Scalps
Figure 3 shows the system architecture of the intelligent scalp detection system (ISDS) [25].The ISDS consists of a scalp detector, an app running on a tablet, machine-leaning techniques [2,26], and a cloud management platform.The scalp detector will be connected with the tablet through a Wi-Fi wireless network.Thus, a scalp photo can be captured via the scalp detector.The scalp photo will be taken by the scalp detector, and the recognized result of the scalp will also be sent and displayed to the tablet.Then, we can obtain quantitative data on scalps.Furthermore, based on ISDS, we will compare three state-of-the-art machine-learning methods using scalp image data: deep learning, BOW [27] with machine-learning classifiers, and PHOG [28] with machine-learning classifiers.
A pre-trained model and transfer learning are included in the deep-learning method as shown Figure 4.The ImageNet-VGG-f model [29] is a pre-trained model that has been trained using 1,200,000 images.The results of this pre-trained model are used as the initial parameters with our scale image data set to perform the fine-tuning function, which can reduce the training time compared with training all images and increases the accuracy compared with choosing random initializations.A pre-trained model and transfer learning are included in the deep-learning method as shown Figure 4.The ImageNet-VGG-f model [29] is a pre-trained model that has been trained using 1,200,000 images.The results of this pre-trained model are used as the initial parameters with our scale image data set to perform the fine-tuning function, which can reduce the training time compared with training all images and increases the accuracy compared with choosing random initializations.Figure 5 shows the second method, which is the combination of BOW and SVM.For the BOW, both the training and testing image features are obtained via the SIFT [30] method.Then, we use the K-means [31] method based on SIFT features to train the images to create a codebook.Histograms [32] are built according to the codebook produced for each test image.A pre-trained model and transfer learning are included in the deep-learning method as shown Figure 4.The ImageNet-VGG-f model [29] is a pre-trained model that has been trained using 1,200,000 images.The results of this pre-trained model are used as the initial parameters with our scale image data set to perform the fine-tuning function, which can reduce the training time compared with training all images and increases the accuracy compared with choosing random initializations.Figure 5 shows the second method, which is the combination of BOW and SVM.For the BOW, both the training and testing image features are obtained via the SIFT [30] method.Then, we use the K-means [31] method based on SIFT features to train the images to create a codebook.Histograms [32] are built according to the codebook produced for each test image.Figure 5 shows the second method, which is the combination of BOW and SVM.For the BOW, both the training and testing image features are obtained via the SIFT [30] method.Then, we use the K-means [31] method based on SIFT features to train the images to create a codebook.Histograms [32] are built according to the codebook produced for each test image.
Finally, those histograms are used to train a SVM classifier, and the SVM classifier is used to predict the test image.The third method, which is shown in Figure 6, used a HOG algorithm to obtain the training and testing image features.This method trained a SVM classifier based on the HOG features of the training images.Finally, those histograms are used to train a SVM classifier, and the SVM classifier is used to predict the test image.The third method, which is shown in Figure 6, used a HOG algorithm to obtain the training and testing image features.This method trained a SVM classifier based on the HOG features of the training images.

Deep Learning
Deep learning has become a popular method for image processing and enables automatic feature extraction as a type of feature learning.The process is improved by replacing feature engineering that requires the analysis of knowledgeable and experienced specialists.
The three general steps performed to complete training on the deep-learning framework include defining a network structure, defining a learning target and using a numerical method.The first step is to identify a network structure to choose several possible functions.With a proper network structure, an efficient deep-learning model can be built through the training process.The second step is to define a learning target by choosing objective functions, such as the mean square error (MSE) and cross entropy.Finally, those histograms are used to train a SVM classifier, and the SVM classifier is used to predict the test image.The third method, which is shown in Figure 6, used a HOG algorithm to obtain the training and testing image features.This method trained a SVM classifier based on the HOG features of the training images.

Deep Learning
Deep learning has become a popular method for image processing and enables automatic feature extraction as a type of feature learning.The process is improved by replacing feature engineering that requires the analysis of knowledgeable and experienced specialists.
The three general steps performed to complete training on the deep-learning framework include defining a network structure, defining a learning target and using a numerical method.The first step is to identify a network structure to choose several possible functions.With a proper network structure, an efficient deep-learning model can be built through the training process.The second step is to define a learning target by choosing objective functions, such as the mean square error (MSE) and cross entropy.

Deep Learning
Deep learning has become a popular method for image processing and enables automatic feature extraction as a type of feature learning.The process is improved by replacing feature engineering that requires the analysis of knowledgeable and experienced specialists.
The three general steps performed to complete training on the deep-learning framework include defining a network structure, defining a learning target and using a numerical method.The first step is to identify a network structure to choose several possible functions.With a proper network structure, an efficient deep-learning model can be built through the training process.The second step is to define a learning target by choosing objective functions, such as the mean square error (MSE) and cross entropy.
Finally, during the training process, we use numerical methods to discover the best combination of parameters, including weights and bias, to reduce the size of the learning target as much as possible.Backpropagation (BP) is usually used for minimizing an objective function.Deep learning is composed of a set of functions that can be used to describe data.If the proper parameters of a function can be obtained, we can predict the new input data through those functions.In the following section, we describe how the parameters were updated using BP.

Backpropagation (BP)
BP is utilized to identify the suitable parameters for the deep network.We use the partial derivatives (∂J/∂w and ∂J/∂b) of the objective function J regarding every weight w and bias b in the deep neural network.Equation (1) shows the objective function.
where t represents the real class of the images and o represents the output of the deep-learning model.
The purpose of BP is to minimize J as much as possible by finding the w.Therefore, BP lessens the difference between the real value and the output of the model.To minimize J and obtain all proper weights, we calculate the partial derivatives of J on w in each layer.We use two layers (as shown in Figure 7) to describe the BP process.First, we calculate the partial derivatives of J from Equation ( 1) with respect to the weight w (2) of the output layer (as given in Equation ( 2)).This partial derivative is shown in Equation (3).
function can be obtained, we can predict the new input data through those functions.In the following section, we describe how the parameters were updated using BP.

Backpropagation (BP)
BP is utilized to identify the suitable parameters for the deep network.We use the partial derivatives (∂J/∂w and ∂J/∂b) of the objective function J regarding every weight w and bias b in the deep neural network.Equation (1) shows the objective function.
where t represents the real class of the images and o represents the output of the deep-learning model.
The purpose of BP is to minimize J as much as possible by finding the w.Therefore, BP lessens the difference between the real value and the output of the model.To minimize J and obtain all proper weights, we calculate the partial derivatives of J on w in each layer.We use two layers (as shown in Figure 7) to describe the BP process.First, we calculate the partial derivatives of J from Equation ( 1) with respect to the weight  (2) of the output layer (as given in Equation ( 2)).This partial derivative is shown in Equation (3). 21 (2)  31 ] (2) ∂  31 (2) ] (3) Target t consists of constants, and the results obtained after performing partial derivatives are shown in Equation (4).
Second, the partial derivatives of J regarding the weight w (1) Equation ( 10) of the layer before the output layer are represented by Equations (11) and (12).The chain rule is used in Equation (13).
For the partial derivatives of J, the biases b (1) and b (2) are shown in Equation (19).
Finally, we use Equations ( 20) and ( 21) to update the parameters, including the weights and bias.
where ij represents the location of the neuron on the l layer and α is the learning rate.

Convolution
Convolution layers are usually composed of a wide range of filters as shown in Figure 8.Those filters can enhance the features of images.For example, the top of Figure 8 shows the pixel representation of a line filter.After convoluting input images with this line filter, a vertical line feature map can be obtained.Through a vertical line feature map, we can extract the line features included in those images.After several layers, the CNN can learn to grab more complex features, such as objects.
where ij represents the location of the neuron on the  layer and α is the learning rate.

Convolution
Convolution layers are usually composed of a wide range of filters as shown in Figure 8.Those filters can enhance the features of images.For example, the top of Figure 8 shows the pixel representation of a line filter.After convoluting input images with this line filter, a vertical line feature map can be obtained.Through a vertical line feature map, we can extract the line features included in those images.After several layers, the CNN can learn to grab more complex features, such as objects.The purpose of training in a CNN is to discover the optimization filters.Through their calculations, testing images can be well represented so that the accuracy of classification can be enhanced.Shared weights and sparse connectivity are two advantages of CNNs that can reduce the number of training parameters.
For shared weights, all neurons in the same hidden layers use the same filter to detect the same feature within an image, such as the line or edge feature.Therefore, different parts of an image use the same filter that includes weights and bias.Sparse connectivity is related to a neuron on the layer being connected to several (not all) neurons of the previous layer.

Rectified Linear Unit (ReLU)
ReLU is a linear activation function and shown in Equation (22).If the input of ReLU, x, is less than 0, then the value of ReLU is zero.If the input of ReLU, x, is larger than 0, then the value of ReLU is still x.Compared with non-linear activation functions, sigmoid and tanh can perform astronomical calculations because both are exponential functions.Most importantly, sigmoid and tanh functions have gradient vanishing problems when implementing BP, which causes missing information.

Max-Pooling
After performing convolution, the number of parameters is too high to train a classifier, such as a Softmax.Thousands of parameters create an overfitting problem and cause astronomical calculations.As shown in Figure 9, the highest value is obtained from a small area in the previous layer.Hence, Max-pooling is proposed to reduce the number of parameters and prevent overfitting.ReLU is a linear activation function and shown in Equation (22).If the input of ReLU, x, is less than 0, then the value of ReLU is zero.If the input of ReLU, x, is larger than 0, then the value of ReLU is still x.Compared with non-linear activation functions, sigmoid and tanh can perform astronomical calculations because both are exponential functions.Most importantly, sigmoid and tanh functions have gradient vanishing problems when implementing BP, which causes missing information.

Max-Pooling
After performing convolution, the number of parameters is too high to train a classifier, such as a Softmax.Thousands of parameters create an overfitting problem and cause astronomical calculations.As shown in Figure 9, the highest value is obtained from a small area in the previous layer.Hence, Max-pooling is proposed to reduce the number of parameters and prevent overfitting.

Fully Connected Layers (FC)
For the ImageNet-VGG-f model, layers 6 to 8 are fully connected layers (FC).Each neuron of the FC connects to every neuron of the previous layer.Hence, the number of parameters of the FC is large.Layers six and seven include activate functions.High-level features can be calculated from layer five, the convolutional layer, the ReLU and the pooling layer.
The main concept of FC is to transform high-level features into one dimension for classification purposes.Therefore, by using matrix multiplication, FC condenses all dimensions into one vector for Softmax to train a classifier.

Softmax
The Softmax layer is based on a logical regression to manage multiclass problems.Therefore, this layer is also called a multinomial logistic regression.In this research, four classes are used: bacteria type 1, bacteria type 2, allergy and dandruff.The output of Softmax presents four probabilities of belonging to a class, and the sum of the four probabilities is equal to 1.

Data Augmentation
Before performing fine tuning through the ImageNet-VGG-f pre-trained model, four categories are included with a different number of images in each.Hence, the data augmentation method is

Fully Connected Layers (FC)
For the ImageNet-VGG-f model, layers 6 to 8 are fully connected layers (FC).Each neuron of the FC connects to every neuron of the previous layer.Hence, the number of parameters of the FC is large.Layers six and seven include activate functions.High-level features can be calculated from layer five, the convolutional layer, the ReLU and the pooling layer.
The main concept of FC is to transform high-level features into one dimension for classification purposes.Therefore, by using matrix multiplication, FC condenses all dimensions into one vector for Softmax to train a classifier.

Softmax
The Softmax layer is based on a logical regression to manage multiclass problems.Therefore, this layer is also called a multinomial logistic regression.In this research, four classes are used: bacteria type 1, bacteria type 2, allergy and dandruff.The output of Softmax presents four probabilities of belonging to a class, and the sum of the four probabilities is equal to 1.

Data Augmentation
Before performing fine tuning through the ImageNet-VGG-f pre-trained model, four categories are included with a different number of images in each.Hence, the data augmentation method is applied to equalize the number of images in those four classifications.Flip and Crop are two data augmentation methods.
We use the fliplr function of MATLAB to exchange the columns in order in the horizontal direction.For example, the first column of the image is exchanged with the last a column of the image.Figure 10a shows the original image, and Figure 10b shows the result of flipping the original image.The Crop method crops part of the original images.The data augmentation method can help to reduce the effect of overfitting.The Crop method crops part of the original images.The data augmentation method can help to reduce the effect of overfitting.

Bag of Words (BOW)
BOW was originally used to detect words.The general purpose of this method is to obtain visual words from all documents and then calculate the number of occurrences of those visual words based on each document.Subsequently, the occurrences are presented by a feature vector in each document.The flow chart in Figure 11 shows the four key parts of the BOW method: (1) feature detection; (2) feature description; (3) cluster; and (4) histogram.

Feature Detection
Feature detection detects the interesting points of images and can be divided into two categories: global features and local features.For global features, the information contained in an entire image is used, and the most popular global feature is GIST [33].
For local features, the images are divided into many sub-images or segmented according to objects within the images.The feature detection of BOW uses local features to detect the critical points of images.Many methods are used to detect key points, such as Harris-Affine, Hessian-Affine [34], maximally stable extremal regions (MSER) [35], Lowe's difference-of-Gaussian (DOG), edge-based regions (EBRs), intensity-based regions (IBRs) [36], and salient regions [37].Tuytelaars and

Bag of Words (BOW)
BOW was originally used to detect words.The general purpose of this method is to obtain visual words from all documents and then calculate the number of occurrences of those visual words based on each document.Subsequently, the occurrences are presented by a feature vector in each document.The flow chart in Figure 11 shows the four key parts of the BOW method: (1) feature detection; (2) feature description; (3) cluster; and (4) histogram.The Crop method crops part of the original images.The data augmentation method can help to reduce the effect of overfitting.

Bag of Words (BOW)
BOW was originally used to detect words.The general purpose of this method is to obtain visual words from all documents and then calculate the number of occurrences of those visual words based on each document.Subsequently, the occurrences are presented by a feature vector in each document.The flow chart in Figure 11 shows the four key parts of the BOW method: (1) feature detection; (2) feature description; (3) cluster; and (4) histogram.

Feature Detection
Feature detection detects the interesting points of images and can be divided into two categories: global features and local features.For global features, the information contained in an entire image is used, and the most popular global feature is GIST [33].
For local features, the images are divided into many sub-images or segmented according to objects within the images.The feature detection of BOW uses local features to detect the critical points of images.Many methods are used to detect key points, such as Harris-Affine, Hessian-Affine [34], maximally stable extremal regions (MSER) [35], Lowe's difference-of-Gaussian (DOG), edge-based regions (EBRs), intensity-based regions (IBRs) [36], and salient regions [37].Tuytelaars and

Feature Detection
Feature detection detects the interesting points of images and can be divided into two categories: global features and local features.For global features, the information contained in an entire image is used, and the most popular global feature is GIST [33].
For local features, the images are divided into many sub-images or segmented according to objects within the images.The feature detection of BOW uses local features to detect the critical points of images.Many methods are used to detect key points, such as Harris-Affine, Hessian-Affine [34], maximally stable extremal regions (MSER) [35], Lowe's difference-of-Gaussian (DOG), edge-based regions (EBRs), intensity-based regions (IBRs) [36], and salient regions [37].Tuytelaars and Mikolajczyk [38] surveyed many local feature detectors.

Feature Description
After detecting the key points of images, we must accurately describe those key points.In this work, we use VLFeat [39], which combines Lowes DOG and SIFT.By exploiting the features of SIFT, such as its invariance to image scale and rotation, suitable information can be obtained.
Thus, the same SIFT features can be obtained regardless of the rotation and scale of the image.Through SIFT, we can finally obtain a 128-dimension feature vector from each key point.The SIFT algorithm identifies the interesting points in different scale spaces.By multiplying the Gaussian kernel with input images, we can obtain different scale spaces.Equation (23) shows the two-dimensional variable scale space's Gaussian function and Equation (24) shows the scale-space.
where L(x,y,σ) is the result of the convolution operation with an input image I(x,y) and the two-dimensional space's Gaussian function and σ is the standard deviation.If the value of σ increases, then the image will be blurred, and vice versa.Multiple Gaussian scale spaces can be obtained by using the convolution operation on an original image and varying σ.
Figure 12 shows that the Gaussian pyramid is constructed by different octaves.Figure 13 shows the four octaves of a dandruff image and five standard deviations in each octave.The images in the same octave are the same size but have different standard deviations.After detecting the key points of images, we must accurately describe those key points.In this work, we use VLFeat [39], which combines Lowes DOG and SIFT.By exploiting the features of SIFT, such as its invariance to image scale and rotation, suitable information can be obtained.
Thus, the same SIFT features can be obtained regardless of the rotation and scale of the image.Through SIFT, we can finally obtain a 128-dimension feature vector from each key point.The SIFT algorithm identifies the interesting points in different scale spaces.By multiplying the Gaussian kernel with input images, we can obtain different scale spaces.Equation (23) shows the two-dimensional variable scale space's Gaussian function and Equation (24) shows the scale-space.

𝐺(𝑥, 𝑦
(, , ) = (, , ) * (, ) where L(x,y,) is the result of the convolution operation with an input image I(x,y) and the twodimensional space's Gaussian function and σ is the standard deviation.If the value of σ increases, then the image will be blurred, and vice versa.Multiple Gaussian scale spaces can be obtained by using the convolution operation on an original image and varying .
Figure 12 shows that the Gaussian pyramid is constructed by different octaves.Figure 13 shows the four octaves of a dandruff image and five standard deviations in each octave.The images in the same octave are the same size but have different standard deviations.
For example, at the bottom of octave1, the σ value is the lowest and the upper layer is k times σ.As σ increases, a variety of scale spaces can be obtained.For different octaves, the next octave is obtained by down-sampling the previous octave.Based on the Gaussian pyramid, the DOG can be generated by subtracting two adjacent images in the same octave.Comparing 26 neighbor pixels of the candidate pixel (as shown in the right side of Figure 10) can obtain a feature point as the local maxima and minima in the DOG space.The 26 neighbor pixels include the 18 pixels of the up and down layer of DOG and the 8 pixels at the current layer of DOG.To identify stable key points within the scale space, the DOG scale-space is obtained by convoluting different DOG kernels and original images as shown in Equation (25).

(b) Key Point Localization
Because all local maxima and minima in the DOG space are not feature points, many redundant points must be removed.Brown and Lowe [40] proposed using the Taylor expansion of the scalespace function (as shown in Equation ( 26)) to increase the accuracy of identifying feature points by removing candidate points that have low contrast.

𝐷(𝑥, 𝑦, 𝜎) = 𝐷(𝑥, 𝑦, 𝜎) + ∂𝐷
where x = (, , σ)  is the offset from the local maxima or minima sampled.Then, the derivative of Equation ( 26) is applied and set to zero.The accurate location of the local maxima or minima sampled can be obtained as  ̂ as shown in Equation (27).
Then, we place Equation (27) into Equation (26), and the result of the first two items is shown in Equation (28).
Finally, the value of D( ̂) at the minima or maxima must be larger than the threshold as shown in Equation (29).

|𝐷(𝑥 ̂)| > threshold (29)
If the value of |D( ̂)| is less than the threshold, then those points are unstable and can be removed.Removing the interesting points with low contrast is insufficient because of the strong response at the image edges within the difference of the Gaussian function.Two situations were observed: when the principle curvature is large across the edge and when the principle curvature is small on the perpendicular edge.Therefore, Lowe used the 2 × 2 Hessian matrix to create the principle curvatures.The Hessian matrix is shown in Equation (30).

H = [
(, )   (, )   (, )   (, ) ] (30) Because all local maxima and minima in the DOG space are not feature points, many redundant points must be removed.Brown and Lowe [40] proposed using the Taylor expansion of the scale-space function (as shown in Equation ( 26)) to increase the accuracy of identifying feature points by removing candidate points that have low contrast.
where x = (x, y, σ) T is the offset from the local maxima or minima sampled.Then, the derivative of Equation ( 26) is applied and set to zero.The accurate location of the local maxima or minima sampled can be obtained as x as shown in Equation ( 27).
Then, we place Equation (27) into Equation (26), and the result of the first two items is shown in Equation (28).
Finally, the value of D( x) at the minima or maxima must be larger than the threshold as shown in Equation (29).
|D( x)| > threshold (29) If the value of |D( x)| is less than the threshold, then those points are unstable and can be removed.Removing the interesting points with low contrast is insufficient because of the strong response at the image edges within the difference of the Gaussian function.Two situations were observed: when the principle curvature is large across the edge and when the principle curvature is small on perpendicular edge.Therefore, Lowe used the 2*2 Hessian matrix to create the principle curvatures.The Hessian matrix is shown in Equation (30).
where D xx (x, y) is the second-order partial derivative of D(x,y,σ) in the x direction at a sample point in DOG.Then, Lowe calculated the individual eigenvalues and only considered the ratio in Equation (31).The numerator of this ratio is the square of the sum of the eigenvalues from the value of the diagonal of H as shown in Equation (32).The denominator of this ratio is the product from the determinant as shown in Equation (33).
where α is the largest of the eigenvalues and β is the smallest of the eigenvalues.After setting α to γβ, the ratio is as shown in Equation (34).
The smallest value of (γ+1) 2 γ is obtained when α and β are the same, and the ratio increases when γ increases.Hence, we check (35) instead of comparing the ratio of principle curvatures that are less than the threshold γ. Figure 14a,b show the Bacteria_1 image before and after the threshold method.The numbers of the points before and after the threshold are counted in Figure 14a,b as 1198 and 610, respectively.
After finding the locations of key points, we assign an orientation to those key points to obtain the benefits of orientation invariance.By calculating Equations ( 36) and ( 37) on adjacent pixels around the key points, we then obtain their gradient magnitudes M si f t and orientation θ si f t .An orientation histogram with 10 degrees for each bin is built according to this statistic.The main directions of the local gradients are high bins.Except for the highest bin in this orientation histogram, the other key points can also be generated when their bins are more than 80% of the highest bin. Figure 15a shows the gradient magnitude and orientation of a key point.Similarly, Figure 15b shows the gradient magnitude and orientation for all key points of an image.36) magnitude and orientation for all key points of an image.

(d) Key Point Description
The key point descriptor primarily describes each key point through a 128-dimension vector.After rotating the main direction of a key point, Lowe calculated the gradient magnitude and orientation of 16*16 adjacent pixels of a key point.The key point as shown in Figure 16a is at the center of the square.The length of each arrowhead represents the gradient magnitude, and the direction of each arrowhead is the orientation.The key point descriptor primarily describes each key point through a 128-dimension vector.After rotating the main direction of a key point, Lowe calculated the gradient magnitude and orientation of 16*16 adjacent pixels of a key point.The key point as shown in Figure 16a is at the center of the square.The length of each arrowhead represents the gradient magnitude, and the direction of each arrowhead is the orientation.Then, Lowe divided those 16*16 grids into 4 × 4 subregions as shown in Figure 16b.Each subregion presents eight directions, such as 45°, 90°, 135°, 180°, 225°, 270°, 315°, and 360°.Finally, we can obtain a 4*4*8-dimension vector to describe each key point.

Cluster (K-Means)
K-means clustering is an unsupervised learning method performed to gather similar objects into same groups.Therefore, the minimum of Equation ( 38) finds the shortest distance for each datum from  1 … … .  that belongs to centroid   .
where SSE represents the sum of squares error.

Cluster (K-Means)
K-means clustering is an unsupervised learning method performed to gather similar objects into same groups.Therefore, the minimum of Equation ( 38) finds the shortest distance for each datum from x 1 . . . . . ..x j that belongs to centroid where SSE represents the sum of squares error.
Five steps are performed to achieve K-means clustering: Step 1: Randomly pick k number of cluster centroids, µ 1 ∼ µ k .
Step 2: Calculate the closest distance between the remaining data and those k centroids.Then, assign each datum to the nearest cluster centroid.
Step 3: Recalculate each cluster centroid using the meaning of each cluster.
Step 4: Regroup all data according to the new cluster centroids.
Step 5: Repeat Step 4 until certain conditions are met, including a lack of change in centroids, a small change of SSE and no data movement.

Histogram
A histogram calculates the number of occurrences of training images based on the codebook obtained via K-means.Then, we normalize the histograms for the training SVM.For the testing image data set, we calculate the normalized histograms based on the same codebook.

Histogram of Oriented Gradient (HOG)
Compared with SIFT feature extraction, we also use the global feature extraction method HOG and its extension PHOG.In this research, the VLFeat [39] function vl_hog is used.The essential purpose of HOG is to divide an image into many blocks, and each block contains cells.For each cell, Dalal built a histogram based on the gradient direction and orientation of each pixel.
The combination of histograms of 2*2 cells can represent the descriptor of the image.Then, to improve the performance of HOG, Dalal normalized all cells inside each block according to the intensity of each block, which is called contrast normalization.This contrast normalization process can reduce the effects of shadowing and light changes.The general scheme of HOG is described as follows and shown in Figure 17.
Stage A: Centered gradients of the horizontal and vertical directions are calculated.Through Equations ( 39) and ( 40), we can obtain the center horizontal and vertical gradients, respectively, where [−1 0 1] is the 1D centered-point discrete derivative mask of the horizontal direction and    is the same mask in the vertical direction.
Stage B: The distribution of the intensity gradient and the orientation of an image can well represent the local object appearance and shape.Equation ( 41) is used to obtain the gradient magnitude, and Equation ( 42) is utilized to obtain the orientation.

Pyramid Histogram of Oriented Gradient (PHOG)
PHOG [41] is an extension of HOG.First, we calculate the different scales of an image.Then, on the same scale, we divide the image into several patches, such as 2*2, 4*4, etc. Subsequently, we calculate the HOF features of those different patches and place all HOF feature results into a one-dimensional array.Finally, we perform the same approach for all other scales of this image and combine the results into the same array.Compared with the HOG method, PHOG can detect features at a variety of scales of an image and can more actively represent the image.

Machine-Learning Classifiers
In this work, we apply Classification Learner Apps [42], which is a tool included in Matlab.Classification Learner Apps includes a variety of machine-learning classifiers such as SVM, decision tree, linear discriminant analysis (LDA), KNN, and ensemble learning.

Support Vector Machine (SVM)
SVM is a classification algorithm proposed by Vapnik [27], and it is based on statistical learning theory.Classification plays a significant role in data mining.The main purpose of the SVM method is to train a classifier via supervised learning.Then, the data can be classified using this model.The purpose of SVM is to calculate the optimum separation hyperplane whose margin is the largest distance from the closest data as shown in Figure 18a.As we may obtain many hyperplanes, the optimum separation hyperplane Equation ( 43) can reduce the effect of noise and decrease the possibility of overfitting.The largest margin separates the pluses from the minuses as much as possible.
where → w is the perpendicular to the optimum separation hyperplane and is the inner product.A number of Hence, if the data belong to the   = 1 class, then  + data can be described by Equation (46).Similarly, if the data are located in the   = −1 class, then  − data can be described by Equation (47).Hence, if the data belong to the y i = 1 class, then x + data can be described by Equation (46).Similarly, if the data are located in the y i = −1 class, then x − data can be described by Equation (47).
Then, multiply y i = 1 by Equation ( 46) and multiply y i = −1 by Equation ( 47) to obtain the result shown in Equation (48).
where Therefore, the dot product of this different vector with the normal unit is the distance between those two support hyperplanes as shown in Equation (49), where → x − in Figure 18b is a vector originating from Equation (45) and → x + is a vector originating from Equation (44).According to Equations ( 44) and ( 45 Margin = ( For the sake of mathematical convenience, maximizing the margin 2 W is equivalent to maximizing 1  W . Also, minimizing W is equivalent to minimizing 1 2 W 2 as shown in Equation (50).
Hence, the solution that identifies the optimum separation hyperplane can solve Equation (51).
Minimize : The solution for the extremum of a function with constraints can use Lagrange multipliers.The benefit of using Lagrange multipliers is to maximize or minimize the equation without considering the constraints.Using the Lagrange multiplier method, Equation (51) can be transformed into the quadratic equation shown in Equation (52).
where α i is the Lagrange multiplier.By calculating the derivatives of L and setting the results to zero, the extremum of equation L are identified.Equations ( 53) and (54) show the derivative of L with respect to w and b, respectively.The result of Equation (53) shows that vector w is the linear sum of all or certain samples.
Equation ( 55) shows the results of plugging the w expression Equation (53) into L in Equation (52).
L can be expressed in Equation (56) because ∑ α i y i = 0.
The objective is to identify the maxima of expression L, and this maximization process depends on x sample vectors.The optimization depends on only the dot product of pairs of samples

Decision Tree
A decision tree is a supervised machine-learning model with simple process intuition and high execution efficiency.It is suitable for the prediction of classification and regression data types.Compared with other machine-learning models, the execution speed is a major advantage.
In addition, one feature of the decision tree is that each decision stage is fairly clear (YES or NO).In contrast, logistic regression and SVMs are similar to black boxes.It is difficult for us to predict or understand their internal complexities and operational details.Additionally, the decision tree has provided instructions for us to actually simulate and draw a decision-making process from the root, to each leaf, and to the final node.

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a classification approach in which the data of different classes can be generated based on different Gaussian distributions.The LDA-based classification approach attempts to find a linear combination of the characteristics of two types of objects or events to be able to characterize or distinguish the characteristics or events.The resulting combination can be used as a linear classifier or, more commonly, for dimensionality reduction for subsequent classifications.

k-Nearest Neighbor Algorithm (K-NN)
The k-Nearest Neighbor algorithm (K-NN) is a statistical method for classification and regression.In both cases, the input contains the k closest training samples in the feature space as follows.In the K-NN classification, the output is a classification group.The classification of an object is determined by a "majority vote" of its neighbors, and the most common categories in the k nearest neighbors (k is a positive integer, typically small) determine the category assigned to the object.If k = 1, the object's category is directly given by the nearest node.In K-NN regression, the output is the attribute value of the object.This value is the average of the values of its k nearest neighbors.
Therefore, the K-NN algorithm uses the vector space model to classify the concepts into similar class cases.The similarity between the cases is high, and it is possible to evaluate the possible classifications of the unknown class cases by calculating the similarity to the known class cases.

Ensemble Learning
A simple understanding of ensemble learning refers to the use of multiple classifiers to predict data sets, thereby improving the generalization capabilities of the overall classifier.We use the classification problem as an explanation.The classification problem refers to the use of some sort of rule for classification.The problem is in finding a certain function.The idea of ensemble learning can generally be understood in this manner: when classifying new data instances, a plurality of classifiers are trained and the classification results of these classifiers are combined (for example, voting) to determine the classification results to obtain better results and to improve the generalization capabilities of the classifier using multiple decision makers together to determine the classification of an instance.

Measurements and Experimental Results
In this work, we use a 200× magnification camera to take scalp images as shown in Figure 19.
The scalp examples are sorted into four groups: bacteria type 1, bacteria type 2, allergy, and dandruff groups as shown in Figure 20a-d, respectively.
neighbors (k is a positive integer, typically small) determine the category assigned to the object.If k = 1, the object's category is directly given by the nearest node.In K-NN regression, the output is the attribute value of the object.This value is the average of the values of its k nearest neighbors.
Therefore, the K-NN algorithm uses the vector space model to classify the concepts into similar class cases.The similarity between the cases is high, and it is possible to evaluate the possible classifications of the unknown class cases by calculating the similarity to the known class cases.

Ensemble Learning
A simple understanding of ensemble learning refers to the use of multiple classifiers to predict data sets, thereby improving the generalization capabilities of the overall classifier.We use the classification problem as an explanation.The classification problem refers to the use of some sort of rule for classification.The problem is in finding a certain function.The idea of ensemble learning can generally be understood in this manner: when classifying new data instances, a plurality of classifiers are trained and the classification results of these classifiers are combined (for example, voting) to determine the classification results to obtain better results and to improve the generalization capabilities of the classifier using multiple decision makers together to determine the classification of an instance.

Measurements and Experimental Results
In this work, we use a 200× magnification camera to take scalp images as shown in Figure 19.
The scalp examples are sorted into four groups: bacteria type 1, bacteria type 2, allergy, and dandruff groups as shown in Figure 20a-d, respectively.

Experimental Results of Deep Learning
As shown in Table 1, the accuracy increases as the learning rate decreases.The three learning rates set in this research are 10 −4 , 10 −5 and 10 −6 .The accuracy assesses the number of correctly predicted images from the deep-learning model with respect to the total number of images.The validation (Val) shown in Figure 21 represents the validation error based on the validation data set.The validation set has been used to estimate the predictive error of the selected model.Similarly, training in Figure 21 represents the training errors created from the training data sets.When the correct label of a testing image is not within the five highest possible labels that the model predicts for this testing image, the top5 err will be high.Similarly, the top1 err indicates that the correct label of a testing image is not the label that the model

Experimental Results of Deep Learning
As shown in Table 1, the accuracy increases as the learning rate decreases.The three learning rates set in this research are 10 −4 , 10 −5 and 10 −6 .The accuracy assesses the number of correctly predicted images from the deep-learning model with respect to the total number of images.The validation (Val) shown in Figure 21 represents the validation error based on the validation data set.The validation set has been used to estimate the predictive error of the selected model.Similarly, training in Figure 21 represents the training errors created from the training data sets.When the correct label of a testing image is not within the five highest possible labels that the model predicts for this testing image, the top 5 err will be high.Similarly, the top 1 err indicates that the correct label of a testing image is not the label that the model predicts for this testing image.A smaller top 1 err and top 5 err indicates better results.As shown in Table 1, the accuracy increases when the learning rate decreases from 10 −6 to 10 −4 .
predicts for this testing image.A smaller top1 err and top5 err indicates better results.As shown in Table 1, the accuracy increases when the learning rate decreases from 10 −6 to 10 −4 .

Experimental Results of BOW with Machine-Learning Classifiers
Table 2 shows the time spent on obtaining the SIFT features, calculating K-means and generating histograms for the training data and obtaining the SIFT features and calculating the histograms for the testing data.This discrepancy is because the training data used the K-means results for the training data to create its own histogram.The experimental results show that the time spent obtaining the SIFT features from the training data does not increase as the number of centers chosen for the K-means method increases.However, the time spent on the other operations increases as the number of centers increases.
Table 3 shows the accuracies based on different numbers of centers, including 10, 50, 100, 300 and 500.As shown in Table 3, the best accuracy of all scenarios was 80.5% and was achieved with SVM.In Table 3, five machine-learning methods, decision tree, LDA, SVM, K-NN and ensemble learning, are implemented using the features resulting from BOW based on different types of centers selected when using K-means.In Table 3, for those five machine-learning methods, the accuracies increased when selecting 10 centers, 50 centers and 100 centers when using the K-means method.The highest accuracy was achieved when using SVM.

Experimental Results of BOW with Machine-Learning Classifiers
Table 2 shows the time spent on obtaining the SIFT features, calculating K-means and generating histograms for the training data and obtaining the SIFT features and calculating the histograms for the testing data.This discrepancy is because the training data used the K-means results for the training data to create its own histogram.The experimental results show that the time spent obtaining the SIFT features from the training data does not increase as the number of centers chosen for the K-means method increases.However, the time spent on the other operations increases as the number of centers increases.
Table 3 shows the accuracies based on different numbers of centers, including 10, 50, 100, 300 and 500.As shown in Table 3, the best accuracy of all scenarios was 80.5% and was achieved with SVM.In Table 3, five machine-learning methods, decision tree, LDA, SVM, K-NN and ensemble learning, are implemented using the features resulting from BOW based on different types of centers selected when using K-means.In Table 3, for those five machine-learning methods, the accuracies increased when selecting 10 centers, 50 centers and 100 centers when using the K-means method.The highest accuracy was achieved when using SVM.Although PHOG is an extension of HOG, the time required to generate PHOG results is far less than that of HOG.Also, the accuracy of applying PHOG is better than HOG, including for the results with and without normalization.
The accuracy of PHOG without normalization is approximately 39.77%, which is 5% higher than that of HOG.Furthermore, after normalization, the accuracy of PHOG is approximately 44%, which is approximately 7% higher than that of HOG.In addition, the time spent on PHOG is 15 times higher than that of HOG.As seen in Table 5, the greatest accuracy was achieved using SVM based on the PHOG features.Compared to the BOG features, when using the same five machine-learning methods, the accuracy of the PHOG features is lower.

Summary
As shown in Table 6, deep learning has the highest accuracy at 89.77%, and the lowest algorithm (PHOG) is at 53.4%.In terms of the time spent on the training and testing data, the PHOG is the quickest algorithm and the BOW is the slowest algorithm, which require 8232.574 and 447,958.95s, respectively.6. Conclusions and Future Works

Conclusions
In this paper, we analyzed scalp images using machine-learning algorithms, including deep learning, BOW with machine-learning classifiers and HOG/PHOG with machine-learning classifiers.The results show the good performance of deep learning, which presented an accuracy rate of 89.77%.This value is far higher than that achieved by the other two combination methods and higher than the human inspection accuracy of 70.2% of human inspection.

Future Works
In the future, we hope to use machine-learning algorithms in scalp detectors to automatically detect scalp problems.We will continue collecting additional scalp images to help train the deep-learning model and increase the training accuracy.In addition, we will consider evaluating some possible hybrid deep-learning/non-deep learning methodologies [43,44] for suitable application evaluations to continuously optimize the performance of scalp image recognition.Moreover, other image classification applications will also be trained and evaluated.

Figure 2 .
Figure 2. A simple structure of a multilayer perceptron (MLP) with a hidden layer.
Appl.Sci.2018, 8, x 3 of 27 datasets are usually automatically divided into two parts (training and testing datasets).The training dataset is for training and optimizing the models, and the testing dataset is for evaluating the efficiency of the model.

Figure 2 .
Figure 2. A simple structure of a multilayer perceptron (MLP) with a hidden layer.

Figure 2 .
Figure 2. A simple structure of a multilayer perceptron (MLP) with a hidden layer.

Figure 3 .
Figure 3. System architecture of the intelligent scalp detection system (ISDS).

Figure 4 .
Figure 4. Training and testing of the deep-learning structure.

Figure 3 .
Figure 3. System architecture of the intelligent scalp detection system (ISDS).

Figure 3 .
Figure 3. System architecture of the intelligent scalp detection system (ISDS).

Figure 4 .
Figure 4. Training and testing of the deep-learning structure.

Figure 4 .
Figure 4. Training and testing of the deep-learning structure.

Figure 5 .
Figure 5. Bag of Words (BOW) with the support vector machine (SVM) method.

Figure 5 .
Figure 5. Bag of Words (BOW) with the support vector machine (SVM) method.

Figure 5 .
Figure 5. Bag of Words (BOW) with the support vector machine (SVM) method.

Figure 8 .
Figure 8. Pixel representation and visualization of filters.

Figure 8 .
Figure 8. Pixel representation and visualization of filters.
Four steps are performed to obtain the SIFT features: (a) scale-space peak selection; (b) key point localization; (c) orientation assignment; and (d) key point description.(a) Scale-space Peak Selection Four steps are performed to obtain the SIFT features: (a) scale-space peak selection; (b) key point localization; (c) orientation assignment; and (d) key point description.(a) Scale-space Peak Selection

Figure 13 .
Figure 13.Dandruff image is displayed at four octaves with five standard deviations.

Figure 13 .
Figure 13.Dandruff image is displayed at four octaves with five standard deviations.

Figure 14 .
Figure 14.(a) Before the operating threshold method; (b) after the operating threshold method.Figure 14.(a) Before the operating threshold method; (b) after the operating threshold method.

Figure 15 .
Figure 15.(a) Gradient magnitude and orientation of a key point; (b) gradient magnitude and orientation of all key points found.

→wwhereFigure 18 .
Figure 18.(a) Margin between two support hyperplanes; (b) margin is the projection of the different vector on the normal unit.

Figure 18 .
Figure 18.(a) Margin between two support hyperplanes; (b) margin is the projection of the different vector on the normal unit.
the support hyperplanes.The margin is the projection of the different vector → x + − → x − on the normal unit → w W , which is normal to the optimum separation hyperplane as shown in Figure 18b.

Table 1 .
Time for deep learning of training and testing data (seconds).

Table 1 .
Time for deep learning of training and testing data (seconds).

Table 2 .
Time for BOW with the training and testing data (in seconds).

Table 3 .
Accuracy of BOW based on different machine-learning classifiers.

Table 2 .
Time for BOW with the training and testing data (in seconds).

Table 3 .
Accuracy of BOW based on different machine-learning classifiers.Experimental Results of PHOG/HOG with SVMCompared with the deep learning and BOW methods, the accuracies of PHOG and HOG are far lower (less than 45%) as shown in Table4.

Table 4 .
Training time, testing time and accuracy for PHOG and HOG.

Table 5 .
Accuracy of PHOG based on different machine-learning classifiers.

Table 6 .
Running time and accuracy comparisons among the four methods.