Improved Transfer-Learning-Based Facial Recognition Framework to Detect Autistic Children at an Early Stage

Autism spectrum disorder (ASD) is a complex neuro-developmental disorder that affects social skills, language, speech and communication. Early detection of ASD individuals, especially children, could help to devise and strategize right therapeutic plan at right time. Human faces encode important markers that can be used to identify ASD by analyzing facial features, eye contact, and so on. In this work, an improved transfer-learning-based autism face recognition framework is proposed to identify kids with ASD in the early stages more precisely. Therefore, we have collected face images of children with ASD from the Kaggle data repository, and various machine learning and deep learning classifiers and other transfer-learning-based pre-trained models were applied. We observed that our improved MobileNet-V1 model demonstrates the best accuracy of 90.67% and the lowest 9.33% value of both fall-out and miss rate compared to the other classifiers and pre-trained models. Furthermore, this classifier is used to identify different ASD groups investigating only autism image data using k-means clustering technique. Thus, the improved MobileNet-V1 model showed the highest accuracy (92.10%) for k = 2 autism sub-types. We hope this model will be useful for physicians to detect autistic children more explicitly at the early stage.


Introduction
Autism is referred to as a complex developmental disability that affects the brain's ability to process information. It is a neurological disorder characterized by weak social interaction, lack of eye contact and restricted, repetitive and stereotyped behaviors [1]. Although no treatment policy has been invented to recover from autism to date, it can be controlled through early intervention and continued therapies [2]. Therefore, every autistic individual needs to receive proper care, support and attention, which is a very important result of its early detection. This diagnosis may help to recover these symptoms and prevent the distress of the child's development and other psychological illness. Nevertheless, it is very difficult to detect and diagnose ASD given its complex nature. Numerous studies have been conducted to explore significant features of autism in various ways such as feature extraction [3], eye tracking [4], facial recognition [5][6][7], medical image analysis [8], apps development [9], voice recognition [10] and so on. Among them, facial recognition plays an important role in recognizing a person's identity or emotional state and can be used to detect autism effectively. It is a popular way to analyze human faces and extract the distinct features between normal and abnormal faces, including mining relevant information to reveal behavior patterns [11,12].
Given advances in predictive analytics of facial pattern recognition, however, some in-depth efforts are currently underway in the space of autistic children's data analyses to to detect ASD earlier. To automate detection of facial expressions of various neurological disorders, Yolcu et al. [13] proposed a novel convolution neural network (CNN) approach that integrated part-based and holistic information, where the first CNN was trained to segment important facial components and the second one was used to recognize facial expression. Further, they improved their system with four CNN models [7], where the first three CNNs were used to partition the facial components, i.e., eyebrow-segmented, eye-segmented and mouth-segmented images, and the fourth CNN was created for the final iconized images to recognize facial expressions of various neurological disorders. In 2018, Haque and Valles [5] modified Kaggle's Facial Expression Recognition 2013 (FER2013) dataset of young autistic children with lighting effects (darker or lighter shades of contrast) and recognized human facial expression using a deep CNN. Rudovic et al. [6] proposed a novel deep learning model called CultureNet that investigated video data of 30 children of different cultural a backgrounds in the context of automated engagement estimation using the child-independent and -dependent settings. Joseph et al. [14] represented a real-time emotion recognition method that predicted the emotion of autistic children following Robot-Assisted Therapy (RAT) using Raspberry Pi 3 and deep-learning techniques.
In this study, we propose a transfer-learning-based face recognition framework to detect autistic children more precisely. To do so, we collected a range of facial images of normal and autistic children, and different machine learning, deep learning and other improved pre-trained models were used to analyze them. Then, the performance of these classifiers was evaluated with various measures, i.e., accuracy, AUC, f-measure, g-mean, sensitivity, specificity, fall-out and miss rate. Then, we improved MobileNet-V1, which shows the highest accuracy in this predictive analysis. In summary, our study is fundamentally an image classification task, where a well-trained classification mode (transfer-learning based) can detect autism given an input image of a child. With the advent of high-spec mobile devices, this type of model can readily provide a test diagnostics of putative autistic traits by taking an image with cameras. The contribution of this work is summarized in the following:

•
We propose an improved transfer-learning-based facial recognition framework that has the potential of yielding high accuracy to identify autistic children. • We explore an improved MobileNet model that shows the best performance among various standard machine learning, deep learning and related pre-trained models. • We focus on a less explored area of image processing and recognition, i.e., identifying ASD-affected children from facial images of normal and autistic children. • We implement and validate a range of different machine learning and deep learning models. • We identify relevant clusters of autistic children.
This paper is organized as follows: Section 2 describes an improved transfer-learningbased autism facial recognition framework that provides insight on how the normal and autistic children's images are investigated. Then, it outlines the primary facial image dataset, proposed framework along with baseline classifiers, improved MobileNet-V1 and K-means clustering algorithms. Section 3 provides the experimental findings of different classifiers including improved MobileNet-V1, and Section 4 reports the performance validation of the proposed framework and their implications. Finally, Section 5 concludes and offers future research directions.

Materials and Methods
In this study, we proposed a transfer learning-based autism facial recognition framework to investigate ASD at an early stage. Therefore, we employed machine learning and deep learning and their pre-trained models that could automatically do robust feature extraction to an extent that is of near-impossible to detect by mere observation due to their subtlety, and we performed classification with them. Moreover, computational analysis of facial scanning features on autism and control faces was indicated as the detector. Hence, improved transfer-learning-based autism face recognition framework and related machine learning methods are described in this section.

Dataset
We collected 2936 facial images of normal and autistic children from the Kaggle data repository, where 1468 normal and 1468 autistic children images were found. This dataset was originally curated by Piosenka [15], where most of the images are downloaded from websites and Facebook pages related to autism. Note that due to the websites of images, many of them are not of the best quality or consistency with respect to the facial alignment, perspective or image size. Next, a python script was developed that automatically crops images such that the input dimension of facial images was 224 × 224 × 3. In this work, collected the dataset was categorized into three groups, i.e., training, validation and test sets. The training set was used for model training and contained a total of 2536 (86.38%) images, where the numbers of images of autistic and non-autistic children were 1268 each. In addition, the validation set is 3.41%, of which the numbers of autistic and non-autistic children images were 50 each. However, the test set was kept separated with 300 (10.22%) images, where the number of images of autistic and non-autistic children were 150 each. Thus, the primary facial dataset was observed and checked, and no major biases/obstacles (to the best of our knowledge and efforts) for further analysis were found.

Improved Transfer-Learning-Based Autism Facial Recognition Framework
The objective of this framework was to identify autism cases by analyzing facial patterns of children employing various machine learning methods. Figure 1 illustrates the schematic workflow diagram of this framework, which is described briefly as follows.
• Data Acquisition: Firstly, the primary facial image dataset was gathered and cleaned. It contains training, validation and test set for further analysis. • Training Stage: Then, the facial training dataset was used to train CNN, pre-trained improved CNN models and other machine learning classifiers. In this circumstances, the classifiers were called baseline classifiers. • Performance of Validation and Test Set: In this training process, various classifiers were employed in the validation set to assess the performance of training activities. Consequently, baseline classifiers were also implemented into the test set and to evaluate of the performance of classifiers. Notably, we considered the distribution of face scanning coordinates as a discriminating features for autism classification. • Evaluation Metrics: In this evaluation process, different metrics such as accuracy, area under curve (AUC), f-measure, g-mean, sensitivity, specificity, fall-out and miss-rate were used to verify the result of various classifiers. Clustering-based Autism Sub-typing: According to previous studies [16][17][18], several sub-types were generated from patients with autism for further medical and neurological analysis. The motivation for conducting clustering on the autistic faces only was to explore different autism sub-types based on quantitative analysis to help to extract and investigate distinguishable significant features of autism. This approach reduces data dimensionality and wraps strong and significant features that can appear in multiple clusters [19]. Therefore, this procedure can lessen false-positive and false-negative results of autism for further supervised learning tasks. To generate the number of clusters/groups from the working dataset, k-means algorithm (see details in Section 2.5) was widely employed in the different fields of machine learning [16,[20][21][22]. In this work, this algorithm was implemented into only autistic facial images (e.g., all autism images were gathered from training, validation and testing sets) and generated numerous autism sub-types in each iteration by changing the values of k from 2 to 10. Then, those sub-types were considered as individual class labels to further demonstrate the predictability of our selected improved MobileNet-V1 (i.e., the best performing model in classifying autistic/normal faces in our first experiment). Note that the problem then became a multi-class classification problem (i.e., it depends on the value of k, e.g., k = 3 would be a 3-class classification and so on), rather than a simple binary-class classification problem (i.e., autistic/normal faces). Note, this multi-class classification task only uses the autistic faces (i.e., no controls were used) for training, validating and testing the best performing improved MobileNet-V1. In addition, this classification was followed 10-fold cross validation and best sub-types has been selected based on accuracy of the classifier.

(C) Identification of Autism Clusters (A) Data Preprocessing Step (B) Evaluation of Baseline Classifiers
Autistic Children Images  To realize the overall structure of proposed framework, it is required to know details about its working methods. These settings are helpful to understand how this framework can detect autism more precisely and generate highest results by the improved model. Therefore, we outline associated machine learning methods of this work in detail as follows.

Improved MobileNet-V1
Improved MobileNet-V1 model is the enhanced version of traditional MobileNet-V1 (see Appendix A.1) with some additional layers which illustrates in details at Figure 2. In this MobileNet-V1, we have augmented several additional layers to increase the performance of this model. The batch normalization layer is used to normalize the output of global average pooling layer by re-centering and re-scaling input values. Like other transfer learning models, it also append three batch normalization (BN), two fully connected (FC) layers one after another before output layer (see Figure 2). Regarding the dimensions, the primary facial images had their input dimensions as 224 × 224 × 3 each, with the input filter dimension as 3 × 3 × 3 × 32 (for detail, please see in Andrew et al. [39]). When we apply the improved MobileNet-V1, the dimensions of input images are reduced according to the regular conversion of MobileNet-V1 depending on the "depthwise separable convolution operation" (see Figure 2). Finally, it generates one-dimensional output for the given input image, depending on the number of classes to be predicted (i.e., binary classification or multi-class classification).

K-Means Clustering
K-Means Clustering is an unsupervised learning method that classifies an unlabeled dataset into various clusters. It is a centroid-based technique, where each cluster is connected with centroid values. Therefore, the sum of the distance between the data points and their corresponding clusters is minimized using this algorithm. At first, the number of clusters and centroids are specified in this algorithm. Then, each data point is assigned to the closest centroids. Repeat this process until all data points are assigned into different clusters.

Experimental Result
To evaluate the performance of baseline classifiers in the proposed framework, we implemented several machine learning classifiers named AdaBoost, DT, GB, KNN, LR, MLP, NB, RF, SVM, XGB using sci-kit learn, CNN and improved pre-trained models of CNNs such as DenseNet121, ResNet50, VGG16, VGG19, MobileNet-V1 and MobileNet-V2 using Keras library in python [23,41,42], respectively. Therefore, all computations were manipulated at Google Colaboratory [43]. The training set was used to train individual classification models as well as the validation set along with the test set in order to evaluate classifiers' performances. Several evaluation metrics like accuracy, AUC, f-measure, g-mean, sensitivity, specificity, fall-out, and miss rate were calculated to measure the performance of these classifiers (see details of metrics in Appendix A.2). The details of experimental results of validation and test set are provided in Tables 1 and 2, respectively.

Evaluation the Result of Validation Set
In the validation set, the improved MobileNet-V1 model provides the highest accuracy, AUC, f-measure, g-mean, sensitivity and specificity as shown in Table 1. It displays 83% accuracy, AUC, f-measure, g-mean, sensitivity and specificity and 17% fall-out and miss rate. In this case, ResNet50 shows the second-highest (80%) accuracy, AUC, f-measure, g-mean, sensitivity and specificity, whereas its fall-out and miss rate are 20%. Other classifiers produce less than 80% and greater than 60% results except fall-out and miss rate in all evaluation metrics separately. Hence, they show that their fall-out and miss rates are below or equal to 40% and above or equal to 20%, respectively. In Figure 3b, ROC curves of individual classifiers for validation set are represented. In these curves, the improved MobileNet-V1 model outperforms than other classifiers. Altogether, this model can be suggested to classify autism and normal patients by investigating their faces more efficiently.   Tables 1 and 2 show the results of individual classifiers for the test set. Again, improved MobileNet-V1 generates the best result (90.67%) for all evaluation metrics. In addition, this model provides the lowest fall-out and miss rates and outperforms other classifiers. Among other models, Dense-Net121 demonstrates the second best prediction, which provides 83.67% accuracy, AUC, f-measure, g-mean, sensitivity and specificity as well as fall-out and miss rate of 16.33%. Then, ResNet50 shows almost 80% accuracy, AUC, f-measure, g-mean, sensitivity and specificity and presents below 20% fallout and miss rate. Other classifiers do not show better performance, where they represent below 80% and upon 65% results along with error rate between 23% to 34%. In Figure 3a, ROC curves of individual classifiers for test set are also represented where the improved MobileNet-V1 model shows the highest result in this analysis.

Evaluation the Result of Test Set
Therefore, improved MobileNet-V1 shows the best outcomes among all general classifiers, CNN and also improved pre-trained models. In addition, it can be justified based on the accuracy of individual base (traditional) pre-trained models for ImageNet [44] (see results in Table 3). In this case, we obtain the top one and top five accuracy of pre-trained models for imageNet and the accuracy of improved pre-trained models for the validation and test set of the facial autism dataset. DenseNet121 shows the best accuracy for ImageNet, but it does not show better results for the autism dataset. On the other hand, the improved MobileNet-V1 [39] shows the maximum outcomes for the autism dataset. ImageNet is an image database organized according to WordNet hirarchy that contains 1,281,167 images for training and 50,000 images for validation, organized into 1000 classes [44]. However, our autism facial dataset is quite small in size compared to the ImageNet database. Hence, better-performing transfer learning models for large scale datasets do not show the performance in this work. Instead, the improved MobileNet-V1 model is able to investigate a small number of data more precisely than other pre-trained models. It also provides more suitable results for improved Mobile-V2 in this work. Therefore, we compared the results of the improved MobileNet-V1 with its base (traditional) MobileNet-V1 for validation and test sets. Again, the improved MobileNet-V1 outperforms its base version for both of the datasets (see Table 4).
However, the reason we did the clustering on the autistic faces only, was to identify different autism sub-types depending of the values of k in the k-means algorithm. Then, what follows is that those sub-types were considered as individual "class labels" to further demonstrate the predictability (train and test) of our selected improved MobileNet-V1 (i.e., the best performing model in classifying autistic/normal faces in our first experiment). In that light, the accuracy that we refer to (for example, the best accuracy of 92.10% for k = 2), was from the multi-class classification tasks that we formulated on the "improved MobileNet-V1" (i.e., applying 10-fold cross validation), using each of the clusters (i.e., autism sub-types) as an individual class label. Note that this multi-class classification task only uses the autism faces (i.e., no controls were used) for training, validating and testing the best-performing "improved MobileNet-V1". Figure 4 shows the accuracy of multi-class classification applying improved MobileNet-V1 when it scrutinized only autism sub-types for k = 2, 3, ..., 10.

Discussion
Previously, some researchers worked with various facial images to recognize discrete characteristics using machine and deep learning. Yolcu et al. [7,13] inspected Radboud Faces Database (RaFD) amalgamating 4-channels using CNN cascade and gained 94.44% and 93.43% accuracy, respectively. It has been useful to monitor and diagnosevarious types of neurological disorders influencing facial expressions. However, RaFD is a set of pictures of controls, and how they analyze neurological disorder from the facial images of controls cannot properly be explained. In addition, autism is a special kind of neurological disorder that cannot be detected by applying the general estimating process of other disorders. In addition, Haque and Valles [5] enforced deep CNN into FER2013 dataset from Kaggle for recognizing facial emotion and found their accuracy 63.11%. Again, Jain et al. [45] utilized the FER2013 dataset from Kaggle using the suggested CNN-RNN+ReLU algorithm and obtained 94.46% accuracy. Joseph et al. [14] implemented SqueezeNet algorithm into Cohn-Kanade and Japanese Female Facial Expression dataset and obtained 75% accuracy. They tried to recognize emotions using the social robot and predict the behavior of ASD children. Nevertheless, all of these works had used control facial images, which were not appropriate to detect autism. Moreover, early detection of autistic and non-autistic children are not normally performed using their emotional activities, as far as our current study is concerned.
In order to perceive ASD, many research works have been conducted so far, but there is still some rooms for identifying and improving this condition effectively [46]. In this work, we proposed a framework that investigates facial images (e.g., autism/normal images) of children employing various machine learning, deep learning and pre-trained transfer learning models. Improved MobileNet-V1 reveals the best predictive performance for training, validation and testing images in all evaluation metrics. After that, numerous autism sub-types were extracted from k-means clustering, and therefore, multi-class classification was performed using each of the clusters (i.e., autism sub-types) as an individual class label. Generally speaking, it is an extremely difficult task to differentiate the true autistic child face from an intentional funny face for any classification model at hand. However, while training, the Convolutional Neural Network (ConvNet/CNN) models with substantially "DEEP"-architecture (i.e., large number of hidden layers) are able to scan the images to extract subtle features from a facial image grid, with a series of convolution-pooling-dropout-batch normalization (optional) operations before finally flattening them with a fully-connected layer followed by the output (prediction) layer. In this process, a deep ConvNet can learn dominant autistic traits (image features), some of which are difficult (or impossible) to be mimicked even with intentional funny poses, including wide-set eyes, the groove below the nose, above the top lip, wide forehead and smaller mid-face [47]-as a single feature in combination. More specifically, we assume that these types of extracted features and their likelihoods of being present in combination within the false positives are much less likely. However, we can not completely rule out the selection bias of images that are used for training at the first place, but deep learning models are generally robust enough to address this issue if a huge number of data (big data) are used for training. Therefore, we consider augmenting further high-quality data sets in our future studies to potentially improve our models generalization capabilities. However, the MobileNet is a CNN architecture model for image classification and mobile vision where this structure is perfectly fit with restricted resources like Mobile devices, embedded systems and computers without GPU or low computational efficiency, with web browsers having limitation over computation, graphic processing and storage. In addition, it is much faster than regular convolution with approximately same result. Therefore, improved MobileNet-V1 is provided more accurate performance than other methods.
We argue that, like other successful applications of transfer-learning models, our study can also facilitate highly accurate prediction of autistic traits from images of children, which has been manifested in our results (i.e., the improved MobileNet-V1 model showed the highest accuracy (92.10%) for k = 2 autism sub-types). Therefore, the multi-class classification task only uses the autistic faces (i.e., no controls were used) for training, validating and testing the best-performing improved MobileNet-V1. However, we did the clustering on the autistic faces only, which was to identify different autism sub-types (i.e., depending of the values of k in k-means algorithm).
However, this study is not fundamentally designed for emotional/behavioral face detection; it detects autism/non-autism from static images. Moreover, we have employed deep learning models that can automatically do robust feature extraction, to an extent that is near-impossible to detect by mere observation due to their subtlety, and performed classification with them. We focused on categorizing autism by investigating static images; hence, video sequence analysis was not included in the experiment.
It is very important to detect autism and ensure taking proper steps at an early stage, for which our proposed framework can play an important role in the medical sector to estimate such types of neurological disorder more quickly than existing systems. Moreover, transfer learning models are more compatible with handheld devices such as smartphones, tablets, etc. Hence, they can be easily incorporated in a mobile app, which will give more assistance to for health workers or physicians to identify autism at an autism resource center. In addition, parents can use this app to recognize autism quickly and more precisely at home. In this process, we could take facial images of children by a smartphone camera, manipulate and identify these cases more quickly than any other procedure. Again, this model is very simple, cost-effective and needs low resources to implement into devices.

Conclusions and Future Works
In this work, we proposed a well-assembled transfer learning based autism face recognition framework, where the improved MobileNet-V1 model shows the best result (83% accuracy for the validation set and 91% accuracy for the test set) in a range of stateof-the-art machine learning and deep learning models to recognize control and autistic children of heterogeneous sources more accurately. Later, k-means clustering method has been applied to autism faces to fabricate various sub-types (i.e., depending of the values of k in k-means algorithm) and used improved MobileNet-V1 to predict high accuracy (92.10%) for binary sub-types (i.e., k = 2) in this system. The proposed framework can play a significant role in early autism detection and can be as a useful tool for physicians and health-workers. It can also be used to detect autism without or with less training in the domestic environment. Some limitations are noted in the proposed framework; for instance, a few facial images has been used and their quality is not promising. However, we cannot associate this work with activity recognition/video sequencing/3D images analysis to formulate more auspicious results. In addition, improved MobileNet-V1 has not provided more stable predictive performance for greater number of autistic sub-types. In the future, we will diminish these shortcomings by ameliorating this model with more standard facial images and mingled dynamic recognition techniques (e.g., activity/motion picture/video sequence identification) to perceive autism more precisely. Moreover, we will further develop our "improved MobileNet-V1" to have more stable predictive performances for a greater number of autistic sub-types (i.e., larger values of K in the K-means algorithm); i.e., we will boost multi-class classification tasks to obtain more accurate predictive performance to define autism sub-types.

Data Availability Statement:
The data used in this paper is available in the references in Section 2.1.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. Traditional MobileNet-V1
Traditional MobileNet-V1 contains a core layer to perform depth-wise separable convolutions that is used to reduce model size and complexity, which is followed by a point-wise convolution. A standard convolution layer takes input of D F × D F × M with feature map F and produces D G × D G × N with feature map G. In this case, it is parameterized by D K × D K × M × N. For standard convolution including stride and padding, the output feature map and its computational cost Q can be calculated as: where D F , D K and D G are the spatial length (width and height) of a square input feature map, kernel and output feature map, respectively. In addition, M is the number of input channels, N is the number of output channels, D K × D K is the kernel size and D F × D F is the feature map size. In depth-wise separable convolution, it uses one filter for each input channel, which can be defined as: Here,K is the depth-wise convolution kernel with D K × D K × M size. Therefore, mth filters inK are implemented and generated mth channel in F andG feature map, respectively. Furthermore, point-wise convolution is represented by 1 × 1 convolution, which reshapes this dimension. Hence, the whole operational cost P of depth-wise separable convolution is defined by: MobileNet makes the usage of 3 × 3 depth-wise separable convolution, which is 8 to 9 times faster than a standard convolution. The reduction cost is calculated by: The MobileNet topology is shown in Figure A1 where the batch norm (BN) and rectified linear unit (ReLU) are connected sequentially after each convolution layer except full connected linear layer. Therefore, this structure uses two model-shrinking hyper-parameters, namely width multiplier, α and resolution multiplier, ρ. The role of α is to thin the network uniformly at each layer. The cost of the depth-separable convolution with α is calculated by:

× ConV
where α ∈ (0, 1] has typical settings of 1, 0.75, 0.50 and 0.25, respectively. Here, α = 1 is used for the baseline MobileNet and α < 1 is used for reduced MobileNets. The second hyper-parameter is called resolution multiplier, and ρ mitigates resources of the neural network. Hence, the computational cost of our network with α and ρ can be represented as: where ρ ∈ (0, 1], which is usually set implicitly so that the input resolution of the network is 224, 192, 160 or 128. Here, ρ = 1 is used for the baseline MobileNet and ρ < 1 for the cost optimized MobileNets. Using multipliers reduces tremendously the multi-adds and parameters, but the accuracy is slightly less in the traditional MobileNet-V1.

Appendix A.2. Performance Evaluation
In this work, we observed and justified corresponding results considering some evaluation metrics such as accuracy, area under the curve (AUC), f-measure, g-mean, sensitivity, specificity, fall-out and miss rate. These metrics are calculated using true positive (TP), true negative (TN), false positive (FP) and false negative (FN), which are defined as follows: