Emotion Detection Using Facial Expression Involving Occlusions and Tilt

: Facial emotion recognition (FER) is an important and developing topic of research in the ﬁeld of pattern recognition. The effective application of facial emotion analysis is gaining popularity in surveillance footage, expression analysis, activity recognition, home automation, computer games, stress treatment, patient observation, depression, psychoanalysis, and robotics. Robot interfaces, emotion-aware smart agent systems, and efﬁcient human–computer interaction all beneﬁt greatly from facial expression recognition. This has garnered attention as a key prospect in recent years. However, due to shortcomings in the presence of occlusions, ﬂuctuations in lighting, and changes in physical appearance, research on emotion recognition has to be improved. This paper proposes a new architecture design of a convolutional neural network (CNN) for the FER system and contains ﬁve convolution layers, one fully connected layer with rectiﬁed linear unit activation function, and a SoftMax layer. Additionally, the feature map enhancement is applied to accomplish a higher detection rate and higher precision. Lastly, an application is developed that mitigates the effects of the aforementioned problems and can identify the basic expressions of human emotions, such as joy, grief, surprise, fear, contempt, anger, etc. Results indicate that the proposed CNN achieves 92.66% accuracy with mixed datasets, while the accuracy for the cross dataset is 94.94%.


Introduction
Recent years have witnessed rapid development in robotics, and its role in society is gradually increasing.It has elevated the importance of emotion detection, as future robots are foreseen as talking with human-like emotions.Similarly, the increasing influence of mute persons in society has also increased the demand for precise emotion detection, and several approaches have been put forward.To identify human emotions, researchers have used different classifications in [1].The study asserts that there are six basic emotions called universal emotions, such as delight, grief, fear, surprise, contempt, and anger.Humans experience these emotions everywhere throughout human cultures in the world.These universal sentiments can always be categorized as one of two main classifications: positive or negative.More feelings are included and discussed later on, such as embarrassment, excitement, shame, pride, satisfaction, and amusement in [2].
Researchers in the past decade have agreed to the point that expression can be predicted by observing one's eyes, eyebrows, and mouth movement, shape, and position.Other challenges come to light when researchers want to make a system that distinguishes emotion [3].While detecting emotions with images or videos, many challenges are faced; the most common issue is occlusion.It happens when the facial features are hidden behind some object, such as a hand covering the face, glasses hiding eyes, the microphone hiding lips, etc.The second most common issue is the variations caused by the position of luminosity called illumination; change in luminosity can cause variations that are significantly larger than the actual differences.This can cause misclassification of the image if the evaluation is based on the comparison.The position of the face is a challenge because, at a different position, different emotions are detected.The system can only detect expressions at 30 • to 35 • .It is hard to detect emotion from other angles.To detect emotions, both eyes and the mouth should be visible and should be in a frontal position.Up tilt or down tilt make emotion detection harder.If the background is the same as the color of the skin, it creates problems to differentiate between the face and the background.Because people have different colors of skin, and shapes of eyes, noses, lips, and jawlines, these features make people different from each other.Such variations are called interclass variations, which make it hard to detect the face and expression of the image.
To identify the feeling of a person using a computer, three methods are used: computer vision, machine learning, and signal processing.The majority of the facial action coding system (FACS) [4] offered by Paul Ekman [1] was employed by the researcher to predict depression, anxiety, and stress levels.There are two main approaches to dealing with expression analysis.The frontal face photo must be fully selected for the first approach before categorization can be performed.The second technique prefers to divide the face image into smaller parts and then calls for processing those fragments.Face tracking, feature extraction, and classification are the general three-step processes used by the methodologies needed to determine a person's expression.The second method rather chooses the partitioning of the face image into sub-segments and then requires the processing of those sub-segments.The techniques required to detect the expression of a person broadly follow a three-step process: face tracking, feature extraction, and classification.Face detection is a process in which a face is located in a frame.Identifying a face in a frame is a procedure known as face detection and is viewed as the preprocessing step in emotion detection [5,6].The ability of computers to recognize human action is one of the most important applications of computer vision.It can be used for a variety of things, such as monitoring children and the elderly, creating sophisticated surveillance systems, and facilitating human-computer interaction.The process that comes next is feature extraction after the face has been detected.It is employed to gather the face's main feature points, which serve as a representation of such features.The main goal of feature extraction is to convert the important aspects of the data into numerical characteristics that can then be employed in the machine-learning process.The final phase is classifying photos into informational categories.The process of classification uses a decision rule to partition the space of spectral or spatial features into different classes.
There are four main techniques to detect the face in a single image: knowledge, feature, template, and appearance-based methods.However, some hybrid techniques are also used for emotion detection.A knowledge-based method is a top-down approach.In this method, the face is located with the help of human-coded rules, such as features of the face, skin color, and template matching.These basic rules are very easy to implement, for example, two eyes are symmetric to each other a nose and a mouth [6].Skin color is unique because it does not change with a change in position or occlusion.However, skin color varies from person to person and with regions.The main problem with this method is to convert human-knowledge-based rules into codes.If the rules are too strict, the face will not be detected; if the rules are too general, the rate of false detection will increase.The other problem with this approach is that it cannot detect a face in different positions or poses [7].The feature-based method is a bottom-up approach [6] and works to find basic facial features to locate faces in various poses, viewpoints, or light.It is designed for face localization.The feature-based method is subdivided into four kinds: facial feature, texture feature, skin color feature, and multiple feature-based methods.The problems with this method are illumination, noise, and occlusion, which cause the corruption of features that makes it harder to detect edges of features or detect many edges, which makes the algorithm inoperable [6].
If only because template-based approaches are simple to use, they do not capture overall facial structure.To distinguish between a group of five emotion expressions (entertainment, rage, contempt, fright, and sorrow) in movies from the BioVid Emo database, the face in videos is detected, and spatial and temporal characteristics (points of interest) are extracted [8].In the appearance-based method, templates are prepared from a number of training images that capture the various forms of facial appearance.In contrast to the template-based method in which the template is designed by experts, in the appearancebased method, a learning approach is adopted to analyze the image to make a template.These templates are the models for face detection.Multiple techniques and analyses are performed to find different characteristics of images.These procedures are designed primarily for the detection of the face, which determines face and non-face frames [6].The most popular face detection algorithm now is the Viola-Jones method.The Viola-Jones algorithm is presumed to be comprised of four stages which can be stated as follows: Or just use a pre-trained cascade to detect an object or facial images within an image.However, with the advancements in technology, it is thereby recommended that the scope of human-computer interaction is widened, and challenges such as occlusions, illumination variations, and changes in physical appearance should be taken into account before considering more novel and practical solutions for detecting emotions with good accuracy.Therefore, this paper proposes a new architecture of convolutional neural networks (CNN) for facial emotion recognition systems.In the proposed framework, face detection utilizes the Viola-Jones cascade followed by face-cropping and image re-sizing.The proposed model is based on five convolution layers, one fully connected layer, and a SoftMax layer.Furthermore, feature map enhancement is employed to accomplish higher precision and the detection of more emotions.Several experiments are performed to detect anger, disgust, fear, happy, neutral, sad, and surprise.Performance is compared with two test models selected for experiments.
The rest of the paper is organized as follows.We present the related work in Section 2, where recent trends in composition studies over the past research papers are compared on the basis of attributes such as face detection, preprocessing, feature extraction, classification, database, and number of emotions, and the accuracy and motivation of our research are established.The proposed framework is discussed in detail in Section 3. The implementation of the proposed framework and results are presented in Section 4. Finally, we present the conclusion and future directions of this research work in Section 5.

Related Work
Prior research placed a strong emphasis on the projection of facial expression, highlighting and identifying the most prevalent emotional traits.However, as time went on, the idea of human-computer interaction and artificial intelligence increased the importance of emotion recognition.Researchers suggested employing local binary pattern histogram and Haar-like features with a cascade classifier to recognize a person's face in real-time movies [9], but no significant work has been conducted to determine emotions.
Vertical projection is applicable to discover the limits of the lips before horizontal projection is used to locate the mouth on the identified area of the face.The Viola-Jones algorithm is used for face detection in a variety of settings, including camera distance, backdrop color, object orientation, etc.So, in [10,11], multi-level systems are proposed that include algorithms such as feature extraction, feature reduction, and principal face detection using the Viola-Jones algorithm.The region of interest (R.O.I), or feature portion of the image, is determined or removed via feature extraction.Despite the fact that this stage is the most crucial and significant one, enough technical information was overlooked.The choosing procedure in this stage determines the efficiency of the system [12].There are a large number of combinations used for feature extraction and classification.Feature extraction can be differentiated into two groups: learned and pre-designed [12].Predesigned feature extraction is handcrafted however learned is an automatic way of feature extraction.Pre-designed features are further divided into two main groups: appearancebased features and geometric features.Additionally, a combination of both of them called the hybrid technique is frequently used [13][14][15].
The most common facial feature extraction techniques are principal, local binary pattern (LBP), Gabor features, and principal component analysis (PCA).However, PCA is mostly used for dimensionality reduction.Landmark and facial points are used for face localization and are used alone or combined with Gabor, LBP, or histogram of oriented gradients (HOG) to extract more accurate features [13].Classification is the final phase of expression analysis; computational methods are used to improve performance, for instance, to make accurate predictions.The expression can be classified directly or first recognizing certain action units.The study [14] employs a support vector machine (SVM) in an elearning system to identify emotions.The achieved accuracy varies from 89% to 100% with respect to the dataset used for testing.
To examine the classifier performance, test samples are used [15].During the training phase, the machine learning algorithm creates a model of the input and creates a hypothesis function for data prediction in [15].One way that machines might recognize facial expressions is by examining the changes in the face when the expressions are shown.The optical flow technique is used to obtain the distortion or vibration vectors caused by facial expressions in the face.The analysis is then performed using the vibration vectors that were gathered.They are employed to benefit from their positions and orientations for automatic facial expression recognition using a variety of data-mining techniques.
During the training phase, the machine-learning algorithm builds a model of the input and creates a hypothesis function for data prediction.Ref. [14] presents a robust approach for facial expression classification using pyramid HOG and LBP features.Hybrid features are extracted from patches of the face that undergo major change during a change in expression.Experimental results using SVM indicate a 94.63% expression recognition rate using the CK+ dataset.The robustness and accuracy of recognizing female expressions are improved by SVM-based active learning in [16] at a higher pace than male emotions.Surprise and fear, on the other hand, have lower rates of emotion recognition.
Recent academic research on emotion recognition typically uses convolution neural networks (CNN) [17,18].CNN has proved to be a promising application for face detection, feature extraction, and classification.This method automatically extracts a characteristic and classifies it, eliminating the need for handmade methods.Convolution layers, activation function, subsampling, and dense layer are the four fundamental components of CNN (fully connected layer).However, several occlusion-based instances of perplexed face pictures were incorrectly identified by a CNN model based on pre-trained deep learning.
In [19], the authors used the CNN model to obtain features from depth information.The model is based on two layers: The feature map at the first layer is 6 and kernel size is 5, then a max pooling is used.The second layer is based on 6 feature maps and a kernel size is 5, max pooling is 2, and then 12 feature maps, and finally Softmax is used.The proposed approach is an illumination variant and obtains an 87.98% accuracy with 1000 epochs.The authors present a fusion of two models for emotion recognition in [20].The multi-signal convolutional model (MSCNN) is used to get spatial features statically and the part-based hierarchical recurrent neural network (PHRNN) is used to get temporal features dynamically and combine them.The PHRNN model is a 12-layer model whereas the MSCNN model has 6 layers.
The study [21] presents a FER model based on CNN which has 3 convolutional layers and consists of 5 × 5 filter size.The authors used the dropout layer as the regularization layer.The proposed model obtains an emotion recognition accuracy of 96% in 3 min.In [22], two convolutional layers are used; first with 5 filter sizes and the second one with 7 filter sizes.The max-pooling layer has a 2 × 2 kernel to reduce the size while the dense layer has 256 hidden neurons.Its learning rate is 0.01 and the training was performed using 2000 epochs.The study obtained promising results, yet ignored the occlusions and illumination variations.The CNN architectural paradigm, which employs the FER2013 database for emotion recognition, is suggested in the paper [23].The dataset includes 32,298 90 × 50 pixel photos.To enhance the performance and to generalize the training and dropout, the authors used regularization techniques.It uses a batch size of 128 after each dense layer.Using 40 training epochs, an accuracy of 74% was attained.
Table 1 presents a comparative review of the discussed research works.It describes the process used to detect the face, preprocessing involved in the approach, the feature extraction approach, the classifiers used for emotion classification, and the reported accuracy.The most common classifier used for emotion detection are decision tree [13][14][15], SVM [24][25][26][27][28][29] and neural networks [23,30].SVM is very effective in terms of memory management and dimensionality.On the other hand, the performance is affected because larger datasets need a longer time in the training phase, and data have more noise.SVM also does not directly provide probability estimates, and these have to be computed separately.
The objective of this review is to view the trends in composition studies within the past years and see how emotions are detected using facial expressions.It is clear from the research that mainly six to seven basic emotions are detected.Predominantly, the Viola-Jones method is adopted for face detection in a frame, and then landmarks or LBP descriptors are used for feature extraction.PCA is applied for dimensionality reduction, and SVM is used for emotion classification.The average accuracy gained by the researcher (15 different methods) is 81.77%.It was observed that the RBF error reduction method is the most efficient.Most of the work was performed in feature extraction, but research work is moving more toward CNN, as it is more efficient and does not need hand-crafted methods to improve performance.It automatically detects features but requires more data sets for training.
For the current study, we use extended the Cohn-Kanade (CK+) [31] and Japanese Female Facial Expression (JAFFE) [32] datasets which contain large data and are frequently used.Our main focus is to mitigate the effects that occur in images due to occlusions.We focus on human emotions such as joy, grief, surprise, fear, contempt, anger, and neutral.

Materials and Methods
In this section, the proposed approach is presented.The mandate of the proposed approach is to consider the challenges like occlusions, illumination variations, and changes in the physical appearance of mute persons' images and mitigate their effects.The model is designed to identify the basic expressions of human emotions such as joy, grief, surprise, fear, contempt, anger, and neutral.The proposed model is based on 6 layers of CNN, in which 5 convolutional layers are used, including the max-pooling layer and one dense layer with a dropout function.Figure 1 provides the flow of the proposed model, where preprocessing of the obtained image set is undertaken in the first step followed by face detection and cropping in the second step.In the third step, the image is flipped vertically, and 2 images and 7 angles from each image are formed producing a total of 14 images in the final step.Furthermore, in the proposed framework, the first convolutional layer uses a 5 × 5 filter.It takes a 32 × 32 sized image of grayscale which means the number of channels is 1.Its output size is 32 feature maps.It breaks images into a small subsection of size 5 × 5. Then to reduce the data of the image, the max pool function is used which pools out the max value in the region as shown in Figure 2.After applying max-pooling, the size becomes 11 × 11 andbut it keeps the output size the same as the convolutional layer.Machine-learning-based models have two phases training and testing/execution; in the training phase, all the data along with labels are provided to the classifier to learn from the pattern between the data and label, while in testing, the trained model is validated.The training phase runs and makes the suitable function, called f (x).Initially, preprocessing is performed on the image, and features are extracted.Then CNN is applied to find the pattern and the trained model is saved.In the next phase, the trained model and weights are loaded to predict the labels for the test samples.

Dataset Description
This study uses publicly available datasets CK+ and JAFFE.These datasets have been widely used in the existing literature.Table 2 shows the number of samples for each dataset.CK+ has 123 subjects who posed eight emotions: anger, contempt, disgust, fear, joy, neutral, sad, and surprise.There is a sequence of images for each emotion, starting with neutral and ending with extreme expression.At this point, the images are manually picked, and then neutral images are separated from the original dataset.Similarly, the remaining images are sorted in respective folders.Now, we have 9591 total images, which also contain duplication.These duplicate images are removed, and the size of the set is reduced to 6362 images.JAFFE is based on 10 female subjects and the total number of images is 213; these images are only separated into respective folders.Detail of the total images for each emotion can be seen in Table 3.The first column represents the emotions sample.The second column has two sub-columns displaying the number of original images per set and, the number of images after removing duplicates from CK+.The third column shows JAFFE detail and the last column presents the total images of each emotion.The total number of images after removing duplication is 6575.

Preprocessing Dataset
In the preprocessing phase, the image is changed into a format that is appropriate for the CNN model.Preprocessing is dived into four main steps: detecting the face, cropping it, flipping it vertically, and making samples of different angles OpenCV [33] is used for preprocessing In the first step, the image is converted into grayscale which converts 3-channel RGB image into 1 channel.To detect the face, Viola-Jones [34] is used with a Haar-like feature by using pre-trained cascades of frontal face files provided by OpenCV, which returns the face area.The face area is cropped and re-sized to 32 × 32 and is vertically flipped.A copy of it is then made.This step doubles the number of images which are then converted into 7 different angles (−45, −30, −15, 0, 15,30,45).This helps generate a large amount of image data and provides more samples to train and test.Moreover, it makes the model train at different angles as well.This process is applied in both the training set and the testing set.It makes our model more powerful and precise in detecting emotions from different angles.
CNN is well-suited for pattern classification problems.CNN is very similar to a neural network, where neurons, activation functions, weights, and learning rates are the same as a neural network.The key difference is in its structural design, as CNN takes images as input.It is specially designed to deal with 2-dimensional data [35,38].In every CNN model, it is essential to set some hyperparameters, such as learning rates, regulation function value, filter sizes, size of the feature map, and the number of hidden neurons.All the performance of the CNN is based on these parameters and the arrangement of layers.The computation of this layer is performed by sliding a window called a filter over the original image by one pixel called stride.This process executes pixel-wise multiplication and adds up to form the result of integers, which shape individual components of the resulting matrix.The output is called a feature map, convoluted map, or activation map.The value of feature maps depends on the values of filters, as different filters generate different feature maps.We just have to initialize the parameters before the training.Following are some parameters related to different convolutional layers.
Hyperparameters are the values that should be set for training.In the current study, batch size, learning rate, weights, biases, hidden neurons, input shape, output values at each layer, etc., are hyperparameters.In this list, some are crucial, such as learning rate and hidden neurons.The batch size used is 'None' because dynamical allocation is preferably desired.For regulation, the dropout function is used only once after the fully connected layer's value is 0.8, and the number of hidden neurons is 1024.The learning rate of the proposed is set to 0.0001 as mentioned in Table 4.

Evaluation and Analysis
The performance of the proposed model is evaluated in terms of testing and validation.Results are evaluated regarding accuracy, which is calculated based on the values of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) given in the confusion matrix.

Performance Comparison
The two recently published models are chosen to compare the performance of the proposed model.The first model, called Test Model 1 [21], consists of 3 convolutional layers.The first two layers have a 32 × 32-feature map, while the third layer has a 64 × 64feature map.It has two fully connected layers each with 1000 hidden neurons.After every convolutional layer, max-pooling of 3 × 3 kernel is performed.The dropout rate is 0.5; the learning rate is not mentioned in the paper, so we use our learning rate (1 × 10 −9 ) 0.0001 to obtain the accuracy.As per the reported results, a 96% accuracy from the model is obtained.Test Model 2 [22] is based on two convolutional layers of 32 and 64 feature maps, respectively, and one fully connected layer that has 256 hidden neurons.Max-pooling is performed after every convolutional layer with a kernel size of 2 × 2. The parameters of Test Model 1 and Test Model 2 are given in Table 5.

Results and Discussions
Two experiments were performed on all testing models and the proposed model; first with the combined datasets including both the CK+ and JAFFE datasets but the images and subjects are unique in training and testing datasets, while the second is based on the cross dataset in which CK+ is used for training and JAFFE for testing purposes.The training dataset contains 51,562 images, the validation set contains 22,078, and the testing set contains 18,410 images.Images allocated to the training dataset are 56%, validation images make 24%, and test datasets are 20%.To make it fair while testing, data are shuffled and stored in different NumPy arrays.It takes approximately 7 to 8 min to complete the preprocessing of 92,410 images and store them in the NumPy array.

Training Phase
In the training phase, 32 × 32 input images are used.All models are trained and tested with the same dataset.Firstly, the training is run for 10 epochs to check the behavior of the models.The total number of iterations on the training set is 5100 and the total number of steps is 7970.First, all the models are trained to 10 epochs and the results are checked for accuracy.The proposed model takes a little time but the accuracy of the proposed model is greater as compared to the other two models.Test Model 1 is the fastest of all but not as accurate as the proposed model.Test Model 2 is neither fast nor accurate.It can be seen clearly in Table 6 that the proposed model has higher accuracy and the least loss as compared to others.Now it is decided to train these models further.The first test is performed after 10 epochs to check the model's accuracy on that point and confirm that the model is correct and capable of prediction.After 10 epochs, the accuracy rate of Model 1 is 83.26%.The most accurate emotion projected by Model 1 is 'sad' with 93.45% precision and the least accurate emotion is fear at the rate of 68.70%.However, happy, sad, and surprise emotions are above 90% accurately predicted as seen in Table 8.Then, a confusion matrix is created, as shown in Figure 3, where the accuracy of testing exceeding the comparable is clearly seen.Test Model 2 only predicted the fear emotion with 100% accuracy and all the other emotions were mispredicted.The overall accuracy of the model is 7.91% as shown in Table 9. Figure 4 shows the number of correct and wrong predictions from Test Model 2 after 10 epochs.It can be seen that all emotions are misclassified except for fear.The accuracy of the proposed model is 86.80%, as shown in Table 10.The most accurate emotion predicted by the proposed model is 'sads at the rate of 93.05% and the least accurate emotion is 'disgusts with a 72.98% accuracy rate.Happy, sad, and surprise emotions are predicted above 92%.
Figure 5 shows the confusion matrix of the proposed approach.Results indicate that its performance is better than both Test Model 1 and Test Model 2 on average, as it produces a higher number of correct emotions.
The comparison of all three models can be seen in Table 11.After 10 epochs, the accuracy rate of Model 1 is 83.26%,Model 2 is 7.91% and the proposed model is 86.80%, which is the highest as compared to other models.The overall accuracy of the projected model is greater than other models.As the first test results were satisfactory, a second test was performed.The second observation was based on the test results after 100 epochs of training.The results of Test Model 1 are given in Table 12.Model 1's accuracy is increased after 100 epochs.All the emotions are detected above 88%.The most accurately detected emotion is happy with an accuracy of 95.14%, and the least accurately detected one is neutral.The overall accuracy is 91.82%.Test Model 2 is only capable of predicting the 'surprise' emotion, as shown in Table 13.A 100% accuracy is obtained for surprise emotions while all other emotions are misclassified.The overall accuracy is 15.44%.
According to the end results of the proposed model, the most accurate emotion recognized by the proposed model is 'sad' at the rate of 95.21%, and the least is neutral emotion with 86.66%.All the emotions have an accuracy rate higher than 86%, as shown in Table 14. Figure 6, shows the confusion matrices after 100 epochs.The confusion matrix for the proposed model after 100 epochs is presented in Figure 7.It indicates that the number of correct predictions is higher as compared to Test Model 1 and Test Model 2. As a result, the prediction accuracy is higher as a whole, as well as, for individual emotions.The proposed model accuracy achieves 92.66% accuracy after 100 epochs of training.It can be seen that the proposed method and Test Model 1 predict happy emotion with the same accuracy of 95.14%.However, the proposed model predicts other emotions more correctly i.e., anger, disgust, fear, and sad as compared to Test Model 1.However, Test Model 1 more accurately predicts neutral emotion.Test Model 2 only predicts surprise emotion with 100% accuracy.A comparison of all three models regarding each emotion is given in Table 15.The comparison reveals that Model 1 is faster than other models, but the proposed model is more precise and performs well, giving more accurate results.The second experiment is based on the cross-dataset evaluation.It is executed to check the performance and fair evaluation of the proposed model.It helps to observe and analyze how the models perform outside of the training dataset.

Preparation of Dataset
In this test, we used the CK+ dataset and JAFFE datasets separately.CK+ was split in half, one for training and the second for validation.The total images of CK+ are 6362 and after prepossessing, it generates 89,068 images.Each dataset contains 44,534 images.The JAFFE dataset is used for testing purposes, which contains 213 images, and after preprocessing, it produces 2982 images as shown in Table 16.For training, we use the same method as in the previous experiment except for the dataset.In the testing phase, we feed the CK+ dataset, train the model, and analyze the behavior of the models.Initially, the results are checked after the completion of training up to 10 epochs.The preliminary observation was the same as before in the experiment.The proposed model took longer time than Test Model 1 but the accuracy of the model is superior to the other two models.Test Model 1 is the fastest of all but not as precise as the proposed model.Test Model 2 is neither fast nor accurate as compared to any other model.

Testing Phase
In the testing session of experiment 2, we use the JAFFE database for each model as the testing dataset.The first observation is based on training setup up to only 10 epochs.To perform this first, we feed the dataset in trained models and check the accuracy.
The results of Test Model 1, given in Table 19, highlight the overall accuracy of this model which is 84.17%.It can identify all emotions but the most accurate one is 'surprise' with an accuracy rate of 95.48% and neutral emotions has the lowest accuracy of 50.71%.However, angry, happy, and sad emotions are above 84% rate.Table 25 shows the results of the proposed model after 100 epochs.The proposed model is capable of detecting 'sad' emotion with 100% accuracy, whereas other sentiments are identified with 94% or higher accuracy except for the 'neutral' emotion with an accuracy of 84.29%.
According to the observation of test 2 based on the cross dataset with 100 epochs, Test Model 1 is adequate to distinguish emotions like disgust, fear, and surprise at a higher rate whereas happy emotion is detected at the same rate by Test Model 1 and the proposed model.The proposed model, on the other hand, recognizes anger, neutral and surprise faces more correctly, as shown in Table 26.The overall accuracy rate of the proposed model is 94.94%.Test Model 1 predicted emotions above 76% and up to 97.29%, whereas the proposed model predicted all emotions above 84% and up to 100% accuracy.The overall accuracy of Model 1 is 93.19%, which is less than that of the proposed model.

Discussions
To compare the results, the precision of all models is considered.Precision is calculated with the following equation: Table 27 tells us about the comparison between models and experiments of observation 1 for experiment 1.According to the given results, the average precision of the proposed model is higher (0.8578) than other models (0.8372 for Test Model 1 and 0.0113 for Test Model 2).
For experiment 2, observation 1, the proposed model has an average precision of 0.8499, whereas Test Model 1 has 0.8488 precision, as shown in Table 28.Test Model 2 shows the lowest precision in both experiments.In experiment 1, the proposed model is able to predict five emotions, disgust, happy, neutral, sad, and surprise, more precisely than other models.A similar trend is observed in the case of experiment 2.

Conclusions and Future Work
A new architecture design for a convolutional neural network is presented in this study for facial expression recognition.By changing the arrangement of the layer and applying a 1 × 10 −4 learning rate, substantial improvement in the precision of the model has been accomplished.Extensive experiments are performed using CK+ and JAFFE datasets.Two strategies are used for experiments, wherein the first involves using CK+ and JAFFE datasets as one dataset, while for the second, CK+ is used for training and validation, and JAFEE is used for testing.Performance is evaluated at different epoch levels and other hyperparameters.Experimental results suggest that the proposed model shows superior performance compared to both models used for performance comparison.The proposed model achieves average accuracy scores of 92.66% and 94.94% for experiments 1 and 2, respectively.To deal with the occlusion and posture change, the images are generated at different angles, and results indicate that the proposed model is able to detect emotions at 45 • .Despite the better results using the proposed model, it is limited to not using dark-colored faces and dark images for emotion detection.
In the future, we intend to make an application using the proposed model that can detect emotions for patients with autism spectrum disorder, who face difficulty in expressing emotions and social interaction.It will help them to communicate with others and can be of great help for diagnostic and therapeutic services.This application scans the person and can translate their intuitions and emotions for other people.It can be extensively used by medical practitioners, therapists, and psychologists who primarily work with people with mental illnesses, developmental disabilities, and neurological disorders, hence providing a great service to society and humanity.

Figure 1 .
Figure 1.Workflow of the proposed methodology.

Figure 2 .
Figure 2. Architecture of the proposed model.In the second layer, the output size increase from 32 × 32 to 64 × 64 with the same filter size.The input size is 11 × 11.After that, max pool volume becomes [4 × 4 × 64], while applying the third convolutional layer results in a size of [4 × 4 × 128].The max-pooling produces a size of [2 × 2 × 128].Now as dropping the output size of the convolutional layer begins, the output rate is reversed.In the fourth layer, after max-pooling, the CNN model makes only a 2 × 2 kernel size 64 feature map and gives [1 × 1 × 64].In the fifth layer of convolution, the volume becomes [1 × 1 × 32] and produces 32 feature maps.The dense layer is applied with 1024 hidden neurons.A dense layer or fully connected layer changes the 2-or multi-dimensional data into flat data.All these layers use the ReLU activation function, which is actually a SoftMax function.Machine-learning-based models have two phases training and testing/execution; in the training phase, all the data along with labels are provided to the classifier to learn from

4. 1 .
Experiment 1 4.1.1.Preparation of Dataset For experiment 1, the images are divided into training and testing data in the ratio of 0.80 to 0.2 for training and testing.As a result, the number of training samples is 5260, while the testing samples are 1315.These sets are used for preprocessing and later for classification.To prepare the dataset, we manually label the emotions of CK+ and JAFFE databases.After labeling images, faces are detected with the help of Haar-cascades.Then the images are cropped to obtain only the face area.It reduces the area and saves on computation as well.The images are resized into 32 × 32.The dataset after prepossessing consists of 92,410 images.The training dataset is further divided into two datasets; the training set and the validation set.The training dataset is used to train the model while the validation dataset is used in training for checking the prediction accuracy during the training phase and adjusting the values of the hyperparameters accordingly.It gives an impartial evaluation of model fit on the training dataset.The test dataset is used to provide a fair evaluation of the final model fit on the training dataset.

Figure 5 .
Figure 5. Confusion matrix of the proposed model after 10 epochs.

Figure 7 .
Figure 7. Confusion matrix of the proposed model after 100 epochs.

Table 1 .
Summary of discussed works along with a different combination of face detection, featureextraction techniques, and databases.

Table 2 .
Number of subjects and number of emotions in each database.

Table 3 .
Number of images according to the emotions in CK+ and JAFFE database original and after removing duplicates.

Table 4 .
Hyperparameters of the proposed model.

Table 5 .
Hyperparameters of Test Model 1 and Test Model 2.

Table 6 .
38rst observation after 10 epochs of training.Upon satisfaction with the performance of the proposed model, the models are trained to 100 epochs.Results are shown in Table7.On 100 epochs, total iterations are 51,000 and total steps are 79,700 during training.After 100 epochs, the proposed method reached an accuracy of 99.44%, and validation accuracy was 93.20%.The loss was decreased to 0.01304 and the validation loss was 0.38968 in a time of38.104s/epoch.Test Model 1's accuracy was 98.33% while validation accuracy was 92.33% within 33.51 s/epochs.For Test Model 2, accuracy was 16.29%, validation accuracy was 15.40%, the loss rate was 19.2747, and validation loss was 19.4809 within 48.152 s/epochs.

Table 7 .
Second observation of the first experiment after 100 epochs of training.To test the models, the previously created testing dataset is used for each model.Feeding the dataset into the model, Table8stats are obtained as follows.

Table 8 .
Test result with prediction detail of Model 1 after 10 epochs.

Table 9 .
Test result with prediction detail of Model 2 after 10 epochs.

Table 10 .
Test result with prediction detail of the proposed model after 10 epochs.

Table 11 .
Comparison of emotion prediction by Model 1, Model 2, and proposed model for experiment 1 after 10 epochs.

Table 12 .
Test result with prediction detail of Model 1 after 100 epochs.

Table 13 .
Test result with prediction detail of Model 2 after 100 epochs.

Table 14 .
Test result with prediction detail of the proposed model after 100 epochs.

Table 15 .
Comparison of emotion prediction by Model 1, Model 2, and proposed model for experiment 1 after 100 epochs.

Table 16 .
Division detail along with numbers of images per set for cross dataset test.
Table 17 shows the experimental results after 10 epochs.The accuracy of the proposed Model is 89.76%, and the loss is 0.29435 where Test Model 1 has 85.88% accuracy, and loss is 0.47114 and Test Model 2 gets only 18.93% accuracy and loss is 18.6617.The time of the proposed model to complete each epoch is 41.302 s where Test model 1 finishes one epoch in 35.641 s.Test Model 2 is the slowest of all with 54.904 s.It can be seen that the proposed model has higher accuracy and the least loss as compared to others.

Table 17 .
First observation after 10 epochs of training of experiment 2.Similar to the first experiment, the second training session is performed up to 100 epochs to evaluate models.The results of the training are shown in Table18.On 100 epochs, the total number of iterations is 44,534 and the total number of steps is 69,600 during training.After the training, the proposed model achieves a 99.40% accuracy and 93.83% overall accuracy.The loss is dropped to 0.01603 and validation loss reaches 0.35530 in time of 35.817 s/epoch.

Table 18 .
Second observation after 100 epochs of training of experiment 2. Test Model 1 obtains 98.34% accuracy and 92.83% validation accuracy.The loss rate is reduced to 0.03960 and the validation loss becomes 0.42293 within 30.463 s/epoch, whereas Test Model 2 only manages to reach the accuracy of 16.11% with a validation accuracy of 17.06%.The loss rate of this model is 19.31624 and the validation loss is 19.09687 in 47.763 s/epoch time.

Table 19 .
Cross dataset testing result of Test Model 1 after 10 epochs.

Table 22 .
Comparison of Test Model 1, Test Model 2, and proposed model after 10 epochs with the cross dataset.After the satisfactory results from test 1 with the cross dataset, a second test is performed which is based on one 100 epochs.After training all the models up to 100 epochs, data is tested on each model and obtain the following outcomes.Results for Test Model 1 are given in Table23.According to the end result of the test, the total accuracy of Test Model 1 is 93.16%.The most precise sentiment detected is 'disgust' with 97.29% accuracy and the least detected emotion is neutral with 76.19% accuracy.Results for Test Model 2 are given in Table24.Results indicate that Test Model 2 predicts only the 'sad' emotion with a 100% accuracy while the accuracy for all other emotions is 0. The average accuracy of Test Model 2 for all emotions is 14.55%.

Table 23 .
Cross dataset testing result of Model 1 after 100 epochs.

Table 24 .
Cross dataset testing result of Model 2 after 100 epochs.

Table 25 .
Cross dataset testing result of the proposed model after 100 epochs.

Table 26 .
Comparison of Test Model 1, Test Model 2, and proposed model after 100 epochs with cross dataset.

Table 30 .
Precision comparison between all three models with respect to observation 2 of Experiment 2 with 100 epochs.