Facial Emotion Recognition in Verbal Communication Based on Deep Learning

Facial emotion recognition from facial images is considered a challenging task due to the unpredictable nature of human facial expressions. The current literature on emotion classification has achieved high performance over deep learning (DL)-based models. However, the issue of performance degradation occurs in these models due to the poor selection of layers in the convolutional neural network (CNN) model. To address this issue, we propose an efficient DL technique using a CNN model to classify emotions from facial images. The proposed algorithm is an improved network architecture of its kind developed to process aggregated expressions produced by the Viola–Jones (VJ) face detector. The internal architecture of the proposed model was finalised after performing a set of experiments to determine the optimal model. The results of this work were generated through subjective and objective performance. An analysis of the results presented herein establishes the reliability of each type of emotion, along with its intensity and classification. The proposed model is benchmarked against state-of-the-art techniques and evaluated on the FER-2013, CK+, and KDEF datasets. The utility of these findings lies in their application by law-enforcing bodies in smart cities.


Introduction
Facial expressions play a very significant role in nonverbal communication. Nonverbal cues can be categorised as facial expressions of a non-communicative nature. It is natural and reflects not only emotions, but also several mental activities, physical gestures, and social interactions [1]. Facial expression recognition (FER) is widely used in several applications, including customer satisfaction recognition, human-computer interaction, medical diagnostics (disease), elderly care, criminal justice systems, security monitoring, smart card applications, and increased law enforcement services in smart cities [2][3][4]. Vision sensorbased FER has attracted attention in current research and has great potential in real-time FER recognition. In vision-based FER, the researchers focused on seven basic expressions, i.e., anger, disgust, fear, happy, neutral, sad, and surprise [5,6], and they categorised the FER into two sub-categories as conventional and deep learning (DL)-based methods.
We propose an efficient framework for FER that can be deployed over resourceconstrained hardware to identify and monitor suspicious activities that can assist law enforcement agencies in smart cities by providing a cost-effective solution to ensure better security.

2.
The proposed framework is based on a lightweight CNN model for FER and uses the VJ algorithm for face detection.

3.
We performed cross-validation of the proposed model to fully assess its generalisation abilities. 4.
The performance of the proposed model is evaluated on several benchmark datasets and the results reveal significant improvements in accuracy compared to state-of-theart approaches.
The rest of the work is presented as follows: Section 2 describes related work, Section 3 explains the methodology, Section 4 presents the results, and Section 5 concludes the paper.

Related Work
FER is currently the subject of considerable active research, and several cutting-edge techniques have been proposed over the past two decades. However, due to the infinitely varied level of facial features in people of different ages, cultures, genders, scales, and perspectives, the procedure requires techniques of better accuracy. The relevant literature includes numerous studies on the use of facial expressions to identify feelings and emotions. Several researchers have proposed different techniques based on conventional and DLbased methods. However, conventional vision-based FER methods have achieved limited performance in extracting features from the given input images and classifying them accordingly. For instance, Kumar et al. [8] proposed a three-tier framework for FER. In the first tier, Otsu's thresholding approach is used to remove the background using the YCbCr colour scheme; in the second tier, the max-min algorithms are used to select the initial cluster values of K-means algorithms to segment the most important regions from the image nose, mouth, forehead, eye gap, and eye; and in the third tier, different shape features are extracted from these segmented regions and fed into a two-level rule-based classifier for FER. Shan et al. [9] employed LBP and discriminant LBP features for the given input images and fed them to the SVM classifier for effective FER. Mansour et al. applied a PCA-based method for efficient FER [10], while Kumar et al. [11] presented a real-time system for FER. The authors used the HoG features descriptor to extract the most prominent facial features and the SVM classifier to differentiate the extracted features into seven different emotions. Sajjad et al. [4] developed an FER-based system for detecting suspicious activity. The authors used the VJ algorithm for face detection; the detected face was then preprocessed by the median and Gabor filters. The ORB features were then extracted and the SVM classifier was trained to classify the seven basic emotions. Wang et al. [14] used geometric LBP and Gabor feature descriptors for salient facial feature extraction; the extracted features were then classified by HMM. Abdulrahman et al. [15] employed the PCA and LBP feature extraction mechanism, using SVM as a classifier to differentiate the extracted features in the seven basic emotions. They used a VJ algorithm for face detection, a Supervised Descent Method to extract prominent features from the detected face, and a decision tree algorithm to classify the seven basic emotions. In conventional vision-based FER, independent feature extraction and classification are the major concerns, require domain experts for prominent feature selection and classification, and are time-consuming and error-prone techniques.
DL-based methods are applied to overcome the challenges of a conventional visionbased system for FER. Numerous research studies have been done to examine FER, and some of the most recent work has focused on developing an effective and efficient training model. For instance, Mayya et al. [37] presented an approach for FER using DCNN features. The authors employed a pretrained DCNN model, which used the pretrained weight of ImageNet [38], and obtained a 9216-dimensional vector for validation with SVM to recognise the seven basic emotions. Sajjad et al. [30] proposed an FER method for behaviour analysis by considering some serious famous TV videos. In this approach, the VJ algorithm is used for face detection, and then Hog, SVM, and a CNN model are used for features extraction and classification. Al-Shabi et al. [7] used a hybrid model for FER. They fused the features of SIFT and CNN for facial feature analysis. Yu et al. [39] investigated the performance of CNNs for FER by employing an assembly of CNNs with five convolutional layers. The authors used a stochastic pooling strategy instead of maximum pooling to achieve better performance. Jain et al. [40] proposed a new DL model containing deep residual blocks and convolution layers for accurate FER. Some DL models, such as the VGG, AlexNet, four-layer CNN, ResNet, and MobileNet, are used in [5,27,33,34,41,42] for accurate FER. However, the time complexity, model size, inferencing speed, and performance on unseen data restrict these systems from real-world implementation; therefore, an efficient and accurate model has yet to be developed. The framework proposed in this paper takes advantage of techniques used to address concerns about increased computational costs and feature extraction from low-resolution images in poor-quality scenarios.

Materials and Methods
FER has become an area of interdisciplinary research. In addition to other applications, FER has a wide range of uses in the field of security, as it can be used to identify and verify a person's impressions in a photo or video. In this work, we recognise that FER is a two-step process: (1) a live video stream using the VJ algorithm for face detection and (2) a four-layer CNN architecture for FER.

Datasets
To measure and evaluate several methods of classification and recognition of facial emotions, we needed standardised datasets. Several facial emotion datasets have been developed in recent decades. The following sections present a detailed overview of some of the standard and benchmark datasets used in this work. The FER-2013 consists of 33,000 grayscale images of faces expressing the seven basic emotions of feeling neutral, happy, anger, sad, surprise, disgust, and fear [43]. Faces are automatically registered so that each image is more or less in the middle and takes up approximately the same amount of space.

CK+
CK+ consists of 593 images of 120 people aged 18-30 years [44]. The dataset includes images that cover all seven basic emotions at a resolution of 640 × 490 or 640 × 480, in 8-bit grayscale. Approximately 81% of the people are European-American, 13% are African American, and 6% are of another ethnicity of descent, with a women-to-men ratio of 65 and 35.

The KDEF
The KDEF [45] consists of 490 JPEG images of 35 women and 35 men depicting seven different emotional expressions at a resolution of 72 × 72 pixels.

Facial Detection Using the VJ Algorithm
The VJ algorithms include the Haar feature selection, AdaBoost learning, and cascading classifier construction. The Haar features are used to recognise darker regions of the eyes from the brighter regions of the nose. This is described by comparing the pixel values of both regions and subtracting the number of estimated pixels in the bright and dark regions to find the difference. The difference is measured with a specific threshold to check the appearance of the object in the image and to classify them as nose, eyes, and chin on face or no-face. In the detection process, each detector consists of a combination of strong and weak successive classifiers, and in our case, the strong classifier is trained using AdaBoost learning through weak classifier combinations obtained by the Haar features of edge, line, or four-sided structures. The Haar features enable the process of interpreting identifiably different parts of a face by creating classifier cascades through the use of whether the identifiable portion is an edge, line, or four-sided structure. In the proposed technique, we have integrated the VJ facial detection algorithm: the camcorder captures a video, extracts the video frames as input images, crops them, and converts them into grayscale images. Once the image is converted into grayscale, it goes through the feature extraction process shown in Figure 1, which shows different images from the framework (pictures of a man, woman, and child) to detect the emotions of both genders. The proposed system is independent of the age factor.

Proposed Model Architecture
In the proposed model, the faces detected by the VJ algorithm are fed into the proposed CNN architecture for prominent feature extraction and classification. The features of the selected frames are received by a series of four convolutional layers and a pooling layer, followed by the ReLu activation function. Afterward, the process frame is passed to the fully connected layer and displayed to classify the input image in its corresponding

Proposed Model Architecture
In the proposed model, the faces detected by the VJ algorithm are fed into the proposed CNN architecture for prominent feature extraction and classification. The features of the selected frames are received by a series of four convolutional layers and a pooling layer, followed by the ReLu activation function. Afterward, the process frame is passed to the fully connected layer and displayed to classify the input image in its corresponding seven emotions. The proposed architecture is presented in Figure 1 and is chosen because of its speed and accuracy, and above all, it is the most reported work. In this architecture, 32 different kernels (size 3 × 3) are applied with batch normalisation and the ReLu activation function using a 224 × 224 × 3 input shape for RGB data and a 224 × 224 × 1 input shape for grayscale. We used maximum pooling with a 2 × 2 kernel size to reduce the dimensions. This process was repeated for all the remaining convolutional and pooling layers by increasing the number of kernels from 32 to 64 in the second layer, from 64 to 128 in the third layer, and from 128 to 256 in the fourth layer. In fully connected layers, 64 and 128 neurons of the first and second fully connected layers are selected, respectively, and the SoftMax layer consists of seven neurons that provide the probability of each class. Table 1 shows the internal architecture of the proposed model.  Figure 2 shows the workflow of the proposed model. The input frame is extracted from the video and each face is detected from the input image through the VJ algorithm. If the face is not detected, the system will scan another face detection video frame and the process continues until the face is detected. When the face is detected, it is cropped into a 224 × 224 pixel size and passed to the CNN model for efficient emotion recognition into the seven different classes-anger, disgust, fear, happy, neutral, sad, and surprise-for the display of the output results.

Results and Discussion
Experimental results were obtained from three benchmark datasets: FER-2013, CK+, and KDEF. The datasets were divided into training, testing, and validation data, where 60% of data were selected for training, 20% for testing, and 20% for model validation. We followed a state-of-the-art method to split the dataset between training, testing, and validation [46]. Before choosing these percentages, we also tested the proposed model over several variants of data splitting, meaning that the proposed model could effectively learn with a lower amount of data. We conducted experiments during the training and testing processes to determine the dramatic changes that occurred in the performance of the proposed system. To obtain streams from VSN [22] for experimental evaluation, we used Python 3.64, OpenCV3+, Keras, and TensorFlow with resource-constrained devices. We used a GTX 1070 GPU with 8 GB of onboard memory and an intel Core i5 processor with 16 GB of RAM to train the model on a system comprising 8 GB of memory, a 2.8 GHz processor, and a 1 terabyte (TB) installed hard drive. Table 2 shows the detailed specifications of the system and important libraries.

Results and Discussion
Experimental results were obtained from three benchmark datasets: FER-2013, CK+, and KDEF. The datasets were divided into training, testing, and validation data, where 60% of data were selected for training, 20% for testing, and 20% for model validation. We followed a state-of-the-art method to split the dataset between training, testing, and validation [46]. Before choosing these percentages, we also tested the proposed model over several variants of data splitting, meaning that the proposed model could effectively learn with a lower amount of data. We conducted experiments during the training and testing processes to determine the dramatic changes that occurred in the performance of the proposed system. To obtain streams from VSN [22] for experimental evaluation, we used Python 3.64, OpenCV3+, Keras, and TensorFlow with resource-constrained devices. We used a GTX 1070 GPU with 8 GB of onboard memory and an intel Core i5 processor with 16 GB of RAM to train the model on a system comprising 8 GB of memory, a 2.8 GHz processor, and a 1 terabyte (TB) installed hard drive. Table 2 shows the detailed specifications of the system and important libraries.

Experimental Evaluation
In this section, we evaluate the performance of the proposed model over benchmark datasets. The detailed experimental results for each dataset are described in the following subsections: Figure 3 shows the accuracy level of our proposed CNN model on FER-2013 during training and validation. The x-axis shows the number of epochs, while the y-axis shows the accuracy of the proposed model concerning training and validation. We set 30 epochs as the standard for model training, and the ratio of accuracy is listed on the y-axis. The validation accuracy of the proposed method on the FER-2013 started at 0.2%, whereas the accuracy of training started at 0.5%. After each epoch, the accuracy of training and validation decreases, and after several epochs, these accuracies become stable. Over 30 epochs, the training and validation accuracy reached 89%, respectively, indicating that the proposed model fits with the data variation. The testing accuracy of each class was 77.78% for anger, 81.50% for disgust, 85.86% for fear, 93.33% for happy, 95% for neutral, 93% for sad, and 90.44% for surprise. The overall testing accuracy of our model using FER-2013 was 89%, and the confusion of all classes is given in Figure 4.

Experimental Evaluation
In this section, we evaluate the performance of the proposed model over benchmark datasets. The detailed experimental results for each dataset are described in the following subsections: Figure 3 shows the accuracy level of our proposed CNN model on FER-2013 during training and validation. The x-axis shows the number of epochs, while the y-axis shows the accuracy of the proposed model concerning training and validation. We set 30 epochs as the standard for model training, and the ratio of accuracy is listed on the y-axis. The validation accuracy of the proposed method on the FER-2013 started at 0.2%, whereas the accuracy of training started at 0.5%. After each epoch, the accuracy of training and validation decreases, and after several epochs, these accuracies become stable. Over 30 epochs, the training and validation accuracy reached 89%, respectively, indicating that the proposed model fits with the data variation. The testing accuracy of each class was 77.78% for anger, 81.50% for disgust, 85.86% for fear, 93.33% for happy, 95% for neutral, 93% for sad, and 90.44% for surprise. The overall testing accuracy of our model using FER-2013 was 89%, and the confusion of all classes is given in Figure 4.

Cross-Validation of the FER-2013 Trained Model over the KDEF and CK+ Datasets
We also evaluated the performance of the proposed model using a cross-crop evaluation matrix, where the model is trained over the FER-2013 dataset and validated over KDEF and CK+ datasets. Tables 3 and 4 present the detailed results. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.

Cross-Validation of the FER-2013 Trained Model over the KDEF and CK+ Datasets
We also evaluated the performance of the proposed model using a cross-crop evaluation matrix, where the model is trained over the FER-2013 dataset and validated over KDEF and CK+ datasets. Tables 3 and 4 present the detailed results. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.

Performance Evaluation over CK+
To evaluate the performance of the proposed model using the CK+ dataset, we experimented with the same number of epochs as previously used for the FER-2013 dataset. The training and accuracy of our model using the CK+ dataset rose from 71% and 90%, as shown in Figure 5, and the accuracy of training and validation reached 92% and 89%, respectively. The testing accuracy is shown in the confusion matrix, as given in Figure 6, which indicates that the overall testing accuracy of the proposed system on CK+ is 90.98%. The proposed model achieved 77.57%, 85%, 88%, 98.31%, 99%, 99%, and 90%, respectively, for the anger, disgust, fear, happy, neutral, sad, and surprise classes. shown in Figure 5, and the accuracy of training and validation reached 92% and 89%, respectively. The testing accuracy is shown in the confusion matrix, as given in Figure 6, which indicates that the overall testing accuracy of the proposed system on CK+ is 90.98%. The proposed model achieved 77.57%, 85%, 88%, 98.31%, 99%, 99%, and 90%, respectively, for the anger, disgust, fear, happy, neutral, sad, and surprise classes.

Cross-Validation of CK+ Trained Model over FER-2013 and KDFE Dataset
The performance of the CK+ dataset-trained model was cross-validated on FER-2013 and CK+ datasets. We selected test samples from each class of the mentioned datasets and performed a cross-validation to verify the robustness of the model. Tables 5 and 6 present the results.

Cross-Validation of CK+ Trained Model over FER-2013 and KDFE Dataset
The performance of the CK+ dataset-trained model was cross-validated on FER-2013 and CK+ datasets. We selected test samples from each class of the mentioned datasets and performed a cross-validation to verify the robustness of the model. Tables 5 and 6 present the results.  Figure 7 illustrates the results of the KDEF dataset in terms of validation and training accuracy, which started at 0.5%. After each epoch, the accuracy gradually increased because of the learning parameters programmed into the machine. Finally, over 30 epochs, the accuracy of training and validation reached 94% and 93%, respectively. The KDEF final confusion matrix indicated that the overall testing accuracy of the proposed model was 94.04%. Anger is identified at an accuracy rate of 91.89%, disgust at 91.50%, fear at 92%, happy at 97%, neutral at 95.89%, sad at 94.28 %, and surprise at 95.78%, as shown in Figure 8. Figure 7 illustrates the results of the KDEF dataset in terms of validation and training accuracy, which started at 0.5%. After each epoch, the accuracy gradually increased because of the learning parameters programmed into the machine. Finally, over 30 epochs, the accuracy of training and validation reached 94% and 93%, respectively. The KDEF final confusion matrix indicated that the overall testing accuracy of the proposed model was 94.04%. Anger is identified at an accuracy rate of 91.89%, disgust at 91.50%, fear at 92%, happy at 97%, neutral at 95.89%, sad at 94.28 %, and surprise at 95.78%, as shown in Figure 8.  The trained model over the KDEF dataset was cross-validated with the FER-2013 and CK+ datasets, and the detailed results are given in Tables 7 and 8. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.  The trained model over the KDEF dataset was cross-validated with the FER-2013 and CK+ datasets, and the detailed results are given in Tables 7 and 8. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.

Performance Evaluation of KDEF
To conclude our analysis of all the above-mentioned results, the proposed model was trained on FER-2013, CK+, and KDEF datasets individually, and their performance was validated. Furthermore, each model was cross-validated on the other two datasets to fully assess the generalisation ability of the proposed model. Based on the above-mentioned results, the proposed model achieved remarkable accuracy over each dataset; however, the performance of the proposed model over the CK+ dataset achieved lower accuracy against the FER-2013, and the FER-2013 achieved lower accuracy against the KDEF. The performance of the proposed model over the CK+ dataset achieved lower accuracy against FER-2013 due to unbalanced samples in FER-2013 datasets when the model was trained over a balanced CK+ dataset. Furthermore, the proposed model achieved lower accuracy in cross-crop validation when the model was trained over FER-2013 and validated over KDEF. The main reason behind the lower performance of the model is that KDEF is an RGB dataset, while the FER-2013 samples were in grayscale.

Comparative Analysis of the Proposed Model with State-of-the-Art Techniques
We conducted several experiments to evaluate the performance of the proposed FER model with other state-of-the-art methods, as shown in Table 9. To evaluate the model's robustness, we first compared the performance of our model with FER-2013; the proposed model surpassed an accuracy of 23.2%, 17.2%, 0.6%, and 2.42%, respectively, compared to the models of Arriaga et al. [51], J. Li et al. [52], Subramanian et al. [53], and Borgalli et al. [46]. We also assessed the robustness of our model using the CK+ dataset, where our model achieved a promising result compared to state-of-the-art methods. The proposed model achieved 1.48%, 9.98%, and 3.29% higher accuracy compared with those of Hasani et al. [54], Borgalli et al. [46], and Bodapati et al. [6], respectively. We then assessed the proposed model using the KDEF dataset. Our model obtained a higher accuracy of 0.65 when compared with the method proposed by Sajjad et al. [30]. To further analyse the model, Haq et al. [55] and Liu et al. [56] achieved 0.39% and 7.14% lower performance compared to the proposed model. Table 9. Comparing the performance of the proposed model with the state-of-the-art method over three benchmark datasets.

Time Complexity of the Proposed Model over GPU, CPU, and Resource-Constrained Devices
We evaluated the performance of a proposed model in real time to compute the processing time of the proposed model over GPU, CPU and resource-constrained device (Jetson Nano). Jetson Nano is a small and powerful computer that runs multiple CNNs in parallel for different applications, such as recognition, segmentation, object detection, and speech processing. Its GPU has 128 NVIDIA CUDA ® cores, the CPU is Quad-core ARM Cortex-A57, and it has 4 GB of memory. The frames per second (fps) of the proposed model using GPU, CPU, and Jetson Nano were 45, 21, and 26 s, respectively. The time complexity of the proposed model is much lower and applicable for deployment in realworld scenarios.

Conclusions
The capabilities built into FER technology with resource-constrained devices, such as the Jetson Nano, can greatly assist law enforcement agencies in effectively identifying suspects by analysing a person's facial expressions. This requires an effective framework to facilitate the identification of fake and suspected individuals from facial expressions. With this in mind, we have proposed an efficient facial expression framework using Jetson Nano, a resource-limited tool that measures facial expressions from video streams captured by the VSN. The proposed framework automatically extracts the face using the VJ algorithm and then identifies facial expressions using the proposed model. The proposed model achieved significantly better results compared to the other methods. The quantitative and qualitative capacities using three different datasets demonstrated the effectiveness of the proposed framework for enhancing law enforcement services in smart cities. In future studies, we will extend the proposed framework to incorporate gender classification and age-predicting factors for the identification of facial emotions in detail. Such a system would enable us to determine the gender, age, and emotions of individuals effectively. We will apply various DL models and review their performance on resource-constrained devices. We will also apply data augmentation techniques to balance the samples in each class and increase the number of samples for all classes to further improve the performance of the proposed model.