Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi

: There is a signiﬁcant interest in facial emotion recognition in the ﬁelds of human–computer interaction and social sciences. With the advancements in artiﬁcial intelligence (AI), the ﬁeld of human behavioral prediction and analysis, especially human emotion, has evolved signiﬁcantly. The most standard methods of emotion recognition are currently being used in models deployed in remote servers. We believe the reduction in the distance between the input device and the server model can lead us to better efﬁciency and effectiveness in real life applications. For the same purpose, computational methodologies such as edge computing can be beneﬁcial. It can also encourage time-critical applications that can be implemented in sensitive ﬁelds. In this study, we propose a Raspberry-Pi based standalone edge device that can detect real-time facial emotions. Although this edge device can be used in variety of applications where human facial emotions play an important role, this article is mainly crafted using a dataset of employees working in organizations. A Raspberry-Pi-based standalone edge device has been implemented using the Mini-Xception Deep Network because of its computational efﬁciency in a shorter time compared to other networks. This device has achieved 100% accuracy for detecting faces in real time with 68% accuracy, i.e., higher than the accuracy mentioned in the state-of-the-art with the FER 2013 dataset. Future work will implement a deep network on Raspberry-Pi with an Intel Movidious neural compute stick to reduce the processing time and achieve quick real time implementation of the facial emotion recognition system.


Introduction
In today's context, video cameras can be easily accessed by everyone. These video cameras can be mobile-based cameras or other static cameras like surveillance cameras, smartphone cameras, Raspberry-Pi cameras, or laptops, etc. With the help of these cameras, it is easy to capture human faces from any location at any place. This kind of freedom has enabled the research community to implement these smart systems for understanding the behavior of humans in real-time. To understand the behavior of a human being, expression plays the most important role. Various surveys have already been done to understand different components that play major roles in understanding human emotions. The outcome of those surveys concluded that non-verbal components, facial expressions, play the most

•
The implementation of hardware prototype for real time facial emotion detection with Raspberry-Pi. • A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images. • Support vector machine (SVM) classification is implemented in the Raspberry-Pi hardware for classifying the persons.
The organization of the study is as follows: Section 2 presents the overview of AI and CNN; Section 3 presents the proposed methodology where the methodology for real time face emotion detection is covered. Section 4 presents the hardware setup, experimental results, and comparison of proposed hardware with previous studies. Finally, the study concluded in the final section.

Background
Artificial Intelligence (AI) refers to the representation of human intelligence of machines designed to think and recreate human actions [19]. The concept can also be extended to any system that demonstrates characteristics consistent with a human mind such as learning and problem-solving. AI is interdisciplinary, however advancements in machine learning and deep learning are triggering a shift in perspective for nearly all sectors [20]. Computer vision is an artificial intelligence area that concentrates on image issues. Convolutional neural networks (CNNs) and computer vision combined can perform complex operations from image recognition to resolving scientific problems [21]. CNNs are well known for the capability of image recognition and classification. In general, a basic convolutional neural network consists of neurons connected via multiple layers. These layers collect the input images and process them in different layers. A simple CNN consists of three types of layers: the convolutional layer, the max-pooling layer, and the fully connected layer. The first two layers are responsible for feature extraction, introducing non-linearity and feature reduction to reduce overfitting. The last layer, known as a fully connected layer, helps in the classification based on the features that are extracted in the previous layers. The fully connected layer contains the majority of the parameters. The number of parameters has also been reduced, presented in architectures like Inception V3 [22] in which the last layer is added, i.e., Global Average Pooling operation. This layer reduces the feature map by taking the average and converting the feature map into a scalar value. To further reduce the parameters, modern CNN architectures have presented the use of residual modules [23] and depth wise-separable convolutions [5]. The depth wise-separable CNNs works by separating the task of feature extraction and combining it within the convolutional layer, hence the parameters are further reduced. So, we have used a Mini-Xception CNN proposed by [24] which reduces the parameters by using depth wise-separable convolution layers instead of simple convolution layers and eliminates fully connected layers.

Proposed Methodology
This section discerns the proposed framework. The elucidation of each step is elaborated on in the next sections. The entire process is divided into three tightly coupled tasks. The first task is to train the pertained deep network after dividing the dataset into training, validation, and testing. The entire dataset of emotive facial images is divided into an 8:1:1 ratio. A dataset with N images will be divided into Ժ TR for training, Ժ VD for cross validation and Ժ T for testing purposes. This means a training set of N number of images I will consist of Ժ TR = I TR 1 , I TR 2 , I TR 3 . . . I TR 4N/5 as training images, Ժ VD = I VD (4N/5)+1 . . . I VD 9N/10 as validation images and Ժ T = I T (9N/10)+1 . . . I T N as testing images. A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images. Training, validation, and testing is done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU using FER 2013 dataset.
The entire architecture is divided into two tightly coupled tasks, i.e., face recognition and facial emotion recognition in real time. For face recognition, a pre-trained deep network known as OpenFace is used. To begin with, a real time image from video is captured as I r 1 . The total number of images captured in real time is δ R = I r 1 , I r 2 . . . I r N . To train the deep network, 6 images of each subject have been used and the network is trained with single triplet method of training for 20 different people with N images and denoted as δ TR . Once the training is complete, for the real time captured image, I r 1 , the very first task is to find the face inside the captured image and discard the unwanted information. The description of parameters used in the proposed architecture is given in Table 1. A well-known method known as HOG (Histogram of Oriented Gradients) has been used to find the faces. After the detection of the face, the facial image I r 1 has been cropped, which is further preprocessed in order to remove the effects of bad lighting, tilted face, and skewness, etc., in the cropped image I r cr . The cropped image is preprocessed with the face landmark estimation algorithm. This algorithm locates 68 landmarks on the cropped image I r cr and, with the help of simple affine transformations, the image is preprocessed I r pr using rotation, shear, and scale to center the eyes and mouth of the cropped image I r cr at best. The preprocessed image I r pr is fed to the pre-trained network to extract the features from I r pr image and generate 128 embeddings that are measurements of the face. The feature of the preprocessed image I r pr is generated with the help of the neural network that will generate a feature vector of 128 embeddings as Φ embd n . The last step is to classify the image by measuring the closest match of 128 embeddings Φ embd n by comparing it with the database images. The feature vector Φ embd n is passed through the simple SVM classifier Š, to recognize the face. The entire architecture with its detailed framework is represented in Figure 1 The second task is to transmit the output of preprocessing stage Í to the cloud where a pre-trained network of emotive facial recognition system is already available. This image Í is again passed through all the layers of the mini-Xception deep network, which is a fast and depth-wise separable convolutional neural network for the recognition of emotion captured in Í image. The deep network on the cloud is trained on seven basic emotions and is labelled as = … where the basic classes of seven emotions is represented as c.

Face Detection
After capturing the facial images in real time using Pi-cam, the first and foremost task is to separate the faces. To remove the unwanted and redundant information from the facial images like the background, a variety of methods are available. The most wellknown method was the Viola Jones algorithm that was invented in the early 20s. We are using another method known as Histogram of Oriented Gradients, or HOG, to detect the facial images. Raspberry-Pi captures the emotive facial images in real-time via Pi-Cam from real time video frames. This image is real-time captured input image Í which is converted to grayscale to extract HOG features to extract the facial part in the input image Í . Finally the facial image is cropped, Í , and is fed to a feature extraction unit to recognize faces in real time. Figure 2a shows the basic steps of face detection using HOG. As shown in Figure 2, the gradients are calculated for the entire grayscale image, and this is done by calculating the gradients for 16 × 16 pixels at a time. The second task is to transmit the output of preprocessing stage I r pr to the cloud where a pre-trained network of emotive facial recognition system is already available. This image I r pr is again passed through all the layers of the mini-Xception deep network, which is a fast and depth-wise separable convolutional neural network for the recognition of emotion captured in I r 1 image. The deep network on the cloud is trained on seven basic emotions and is labelled as {ε c = ε 1 ε 2... ε 7 } where the basic classes of seven emotions is represented as c.

Face Detection
After capturing the facial images in real time using Pi-cam, the first and foremost task is to separate the faces. To remove the unwanted and redundant information from the facial images like the background, a variety of methods are available. The most well-known method was the Viola Jones algorithm that was invented in the early 20s. We are using another method known as Histogram of Oriented Gradients, or HOG, to detect the facial images. Raspberry-Pi captures the emotive facial images in real-time via Pi-Cam from real time video frames. This image is real-time captured input image I r which is converted to grayscale to extract HOG features to extract the facial part in the input image I r . Finally the facial image is cropped, I r cr , and is fed to a feature extraction unit to recognize faces in real time. Figure 2a shows the basic steps of face detection using HOG. As shown in Figure 2, the gradients are calculated for the entire grayscale image, and this is done by calculating the gradients for 16 × 16 pixels at a time.
from real time video frames. This image is real-time captured input image Í which is converted to grayscale to extract HOG features to extract the facial part in the input image Í . Finally the facial image is cropped, Í , and is fed to a feature extraction unit to recognize faces in real time. Figure 2a shows the basic steps of face detection using HOG. As shown in Figure 2, the gradients are calculated for the entire grayscale image, and this is done by calculating the gradients for 16 × 16 pixels at a time.  This calculation is repeated for the entire grayscale image, and we will end up with an image of gradients. The next step is to calculate the strongest gradients in 16 × 16 windows of pixels and replace the gradients in that window with the strongest gradient. These will result in a basic image that consists of basic structure of face. To locate the face in the real-time captured input image or on real time video, we only located that part of the image which looks remarkably like a known HOG pattern and crop that part of the input image I r , as a result we get the cropped facial image I r cr .

Face Alignment
As the face is captured in real-time, the image captured can have faces turned in different direction. To deal with such situations, we wrapped each picture so that our system can locate the eyes and lips in a sample place. To perform this operation, we have used an algorithm proposed by [23] which is known as face landmark estimation. The main work of this algorithm is to locate 68 specific points known as facial landmarks, as shown in Figure 2b, of all faces.
These landmarks locate the eyes, nose, chin, lips, and eyebrows etc., on any face. As explained by [19], is the vector that represents the p number of facial landmarks in image I. The main aim is to perform the estimation of S to the best possible estimate, which is nearest to the true shape and is denoted by S (t) . This is done with the help of a cascade of regressors, where each regressor keeps on predicting and continuously updating the vector so that the estimation is accurate. S (t+1) = S (t) + r t S (t) is the method in which the regressor r t (., .) is being used in a cascade for prediction and updating the vector S (t+1) .

Face Encoding
The next important step is to extract the features from the exactly centered image. The best way to get the unique features of any facial image is to measure the face. The dimensions of each face are different. The main challenge is about which measurement plays a vital role in recognition of captured image. This task can be difficult to achieve if performed with the traditional method of feature extraction. To achieve accuracy and raise the speed, a deep network is trained as machines have been proven to be better than humans when it comes to prediction. Training a deep network requires a lot of computation and power from a system. So, we used a pre-trained network which is provided by OpenFace [20]. Now, we just give the input and the Deep network that measures the 128 measurements for each face instead of single face; the network has been trained on 3 facial images at an instant as shown in Figure 3. This is achieved by training the first image (Image anchor) of person with a second image (positive) of the same person, with a completely different image (Image negative) of another person, as shown in Figure 4. The main purpose is to have the image anchor closer to image positive, as compared to any other image, called image negative. The selection of a triplet to carry out 128 measurements is important. Machine learning experts call these measurements of every individual face "embedding". Training on face embedding using a large set of images called dataset will improve the accuracy and decrease the error rate eventually. This process requires huge CPU power and lot of time. To understand triplet loss, consider the representation as f (y) = I s which is representing an image y into s-dimensional Euclidean space. We oblige this implanting to live on the s-dimensional hypersphere. f (y) 2 = 1. As shown in Figure 3a, the main aim is to achieve minimum distance between y m j (Anchor) of a specific person with all the other images y p j (Positive) of the same person as compared to the image of any other person y n i (Negative).

∑ (2)
Generation of multiple triplets will help to overcome the issue faced in Equation 1 and selection of suitable and complex triplets will result in the improvement of the deep learning model.

SVM Based Classification
The last step is the most important step of finding the names of persons from the encodings. Different techniques have been presented that will help in the evaluation of various classifiers. A variety of machine learning classification algorithms can be used to classify the faces but the most simple and efficient one has been used for classification of faces, known as support vector machine (SVM). We kept it simple because we only want the output to be the face with the name of the person. Moreover, we are implementing this on Raspberry-Pi, so we want our system to be fast and accurate. Running this classifier on hardware takes milliseconds, which what we want, and the result of this classifier is the name of the person.
Generation of multiple triplets will help to overcome the issue faced in Equation 1 and selection of suitable and complex triplets will result in the improvement of the deep learning model.

SVM Based Classification
The last step is the most important step of finding the names of persons from the encodings. Different techniques have been presented that will help in the evaluation of various classifiers. A variety of machine learning classification algorithms can be used to classify the faces but the most simple and efficient one has been used for classification of faces, known as support vector machine (SVM). We kept it simple because we only want the output to be the face with the name of the person. Moreover, we are implementing this on Raspberry-Pi, so we want our system to be fast and accurate. Running this classifier on hardware takes milliseconds, which what we want, and the result of this classifier is the name of the person.  So, we want to have the following: where β is the enforced margin between negative and positive pair of images and ζ is the set of all the possible triplets and has numbers equal to number P.
Appl. Sci. 2021, 11, 10540 8 of 17 Generation of multiple triplets will help to overcome the issue faced in Equation (1) and selection of suitable and complex triplets will result in the improvement of the deep learning model.

SVM Based Classification
The last step is the most important step of finding the names of persons from the encodings. Different techniques have been presented that will help in the evaluation of various classifiers. A variety of machine learning classification algorithms can be used to classify the faces but the most simple and efficient one has been used for classification of faces, known as support vector machine (SVM). We kept it simple because we only want the output to be the face with the name of the person. Moreover, we are implementing this on Raspberry-Pi, so we want our system to be fast and accurate. Running this classifier on hardware takes milliseconds, which what we want, and the result of this classifier is the name of the person.

Dataset
The training of the network is illustrated in the Figure 4, here we have mapped the unique images from a single network into triplets. The gradient of the triplet loss is back propagated to the unique images through the mapping. The dataset that we have used consists of 35,888 images of facial emotions with seven categories (0 = Angry, 1 = Disgust, 2 = Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral). The FER 2013 dataset consist of 48 × 48 pixel gray scale images (https://www.kaggle.com/msambare/fer2013 (accessed on 11 May 2021)). The dataset that we have used is in csv format, consisting of only two columns, i.e., "emotions" and "pixels" and is kept in Google drive. The entire data is divided into an 8:1:1 ratio for training Ժ TR , validation Ժ VD and testing Ժ T .
The number of images available in the dataset has been categorized as per the expression, and the total number of images under each category is shown in Table 2 and the graphical representation of dataset is shown in Figure 5. The FER 2013 dataset is not a uniform dataset, and it does not contain a uniform number of images under each category. Figure 6 shows the sample images from the FER 2013 dataset. A large number of datasets is available to detect facial emotions.

Training CNN Model: Mini Xception
The dataset has been kept on Google drive and the training has been done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU. The CNN has been trained with 80% of training data from the FER dataset and the remaining 10% of dataset is kept for validation. The architecture of Mini-Xception proposed by [24] is shown in Figure 7. Testing has been done on the remaining 10% of data as shown in Figure 8 and on the input given by Raspberry-Pi after detecting the face and converting the cropped and pre-processed images into 48 × 48 sizes. This architecture is trained on the FER 2013 dataset because we want the response to be quick and the proposed architecture of Mini-Xception has been proved to be quick and light because of its unique architecture and replacement of convolutional layers with depth wise convolutional layers, which will reduce the number of parameters and make it reliable to implement for real time emotion recognition. The results of training are shown in Figure 9 and training loss is represented in Figure 10. As our system is based on Raspberry-Pi, which has certain constraints in-terms of memory and processing capability, so a smaller number of parameters will be helpful in future advancement of this system. We have achieved an accuracy of 66% on FER 2013 dataset (without augmentation), as mentioned in the state-of-the-art, and 68% after data augmentation. The main reason behind this accuracy is variation in the dataset and non-uniformity of images under each category.
our system is based on Raspberry-Pi, which has certain constraints in-terms of memory and processing capability, so a smaller number of parameters will be helpful in future advancement of this system. We have achieved an accuracy of 66% on FER 2013 dataset (without augmentation), as mentioned in the state-of-the-art, and 68% after data augmentation. The main reason behind this accuracy is variation in the dataset and nonuniformity of images under each category.

Experimental Results
The detailed results are shown in this section. We have divided the entire architecture to carry out two main tasks. The first task of face recognition has achieved an accuracy of 100%, as all the faces that were trained are recognized correctly, as open face has provided near-human accuracy [20] on the LFW benchmark. So, we have used it and implemented it in real time video with the help of the Raspberry-Pi 3 B+ model. We have used Python with OpenCV and with the help of Pi-Cam we acquired the live video and recognize the faces in those live videos. The proposed framework can locate multiple faces in the frames but is able to recognize only those that are already trained and present in the dataset. Face recognition has given correct results even with different objects like spectacles. We have installed the setup along with the biometric attendance area so that, at the time of punching in and punching out, the expression on the faces of employees can be recorded and, after collecting the data of the fortnight, the recorded faces with names and recognized expressions can be analysed. This analysis can be useful to recognize the consistent behavior of the employees in private organizations. For example, an employee with a constant sad or disgusted expression on his face can be identified and can be reported to a happiness cell or psychological support cell for helping such employees and The comaparison with various models is shown in Table 3 and specifications of the system are given in Table 4, and the detailed algorithm is explained in Algorithm 1 and Algorithm 2. After detecting the face in real time, the cropped and pre-processed image is given to the pre-trained deep network which is trained on the FER 2013 dataset using Python and Keras. The results after classification are shown in the confusion matrix, as shown in the Table 5. Several misclassifications have been found as "disgust" is misclassified as "Angry". From the dataset one can easily locate that the count of disgusted faces in the dataset is least 547. This simply indicates that the number of features that the network has been trained to classify as a disgusted face is less compared to other classes. Hence, the misclassification took place.
The total number of parameters for which the network has been trained is 2,134,407. When tested on real time video, 110 out of 120 images with expressions are recognized correctly. Figure 10 is the graphical representation of model accuracy and Figure 11 is the graphical representation of model loss. Figure 12 shows the real time face recognition result. Table 3 shows the comparison of available models.

Experimental Results
The detailed results are shown in this section. We have divided the entire architecture to carry out two main tasks. The first task of face recognition has achieved an accuracy of 100%, as all the faces that were trained are recognized correctly, as open face has provided near-human accuracy [20] on the LFW benchmark. So, we have used it and implemented it in real time video with the help of the Raspberry-Pi 3 B+ model. We have used Python with OpenCV and with the help of Pi-Cam we acquired the live video and recognize the faces in those live videos. The proposed framework can locate multiple faces in the frames but is able to recognize only those that are already trained and present in the dataset. Face recognition has given correct results even with different objects like spectacles. We have installed the setup along with the biometric attendance area so that, at the time of punching in and punching out, the expression on the faces of employees can be recorded and, after collecting the data of the fortnight, the recorded faces with names and recognized expressions can be analysed. This analysis can be useful to recognize the consistent behavior of the employees in private organizations. For example, an employee with a constant sad or disgusted expression on his face can be identified and can be reported to a happiness cell or psychological support cell for helping such employees and making them feel good and comfortable in the workplace. The hardware setup of the proposed system is shown in Figure 8.
The comaparison with various models is shown in Table 3 and specifications of the system are given in Table 4, and the detailed algorithm is explained in Algorithm 1 and Algorithm 2. After detecting the face in real time, the cropped and pre-processed image is given to the pre-trained deep network which is trained on the FER 2013 dataset using Python and Keras. The results after classification are shown in the confusion matrix, as shown in the Table 5. Several misclassifications have been found as "disgust" is misclassified as "Angry". From the dataset one can easily locate that the count of disgusted faces in the dataset is least 547. This simply indicates that the number of features that the network has been trained to classify as a disgusted face is less compared to other classes. Hence, the misclassification took place.  The total number of parameters for which the network has been trained is 2,134,407. When tested on real time video, 110 out of 120 images with expressions are recognized correctly. Figure 10 is the graphical representation of model accuracy and Figure 12 is the graphical representation of model loss. Figure 11 shows the real time face recognition result. Table 3 shows the comparison of available models. the results which are very low, specifically for disgust, fear, and sad emotion. So, a data balancing technique is used for balancing the data. Keras API helps to increase the data set by applying various techniques by using the Image Data Generator function. This mainly includes five functions, i.e., rotation at a certain angle, shearing, zooming, rescale, and horizontal flip. Before data augmentation, a total of 35,887 images were used, out of which only 547 images were of disgusted expressions. After applying data augmentation, a total of 41,904 images were used, of which 6564 were of disgusted faces. The confusion matrix after data augmentation is shown in Table 6.  Table 6 shows the confusion matrix after data augmentation. It can be noticed that prediction for disgust and fear has improved. The overall efficiency of system after data augmentation has been raised by 2% and came out as 68%. Table 7 shows the comparison of proposed edge device with previous studies.      To evaluate the performance and effectiveness of proposed edge device, we have compared with the previous studies on facial emotional detection. It has been realized that the hardware implementation for facial emotion detection and recognition is less implemented. A few studies that have implemented this hardware recorded lower accuracy, like 51.28% and 47.44%, when compared with the proposed model, i.e., 68%.
As discussed in the above section, the FER 2013 dataset is not a balanced dataset. A total of 35,887 images of 7 classes are present in this dataset. The unbalanced dataset gave the results which are very low, specifically for disgust, fear, and sad emotion. So, a data balancing technique is used for balancing the data. Keras API helps to increase the data set by applying various techniques by using the Image Data Generator function. This mainly includes five functions, i.e., rotation at a certain angle, shearing, zooming, rescale, and horizontal flip. Before data augmentation, a total of 35,887 images were used, out of which only 547 images were of disgusted expressions. After applying data augmentation, a total of 41,904 images were used, of which 6564 were of disgusted faces. The confusion matrix after data augmentation is shown in Table 6.  Table 6 shows the confusion matrix after data augmentation. It can be noticed that prediction for disgust and fear has improved. The overall efficiency of system after data augmentation has been raised by 2% and came out as 68%. Table 7 shows the comparison of proposed edge device with previous studies.

Conclusions
Real time detection of any kind of activity that is suspicious in nature is difficult to identify without any actual interaction with the subject or suspect. Reading the face of a person in real time is a challenging task. With the help of compact and portable devices, it becomes easy for the majority of organizations to understand the behavior of their employees and resolve some of the minor and major issues at an early stage. To achieve that, a framework has been tested and proposed that can be implemented in any organization to understand employee behavior. The proposed framework is a costeffective and compact alternative over all those heavy and bulky systems that are difficult to implement in real time. The system has been tested for 20 different people with all 7 emotions, and out of total 120 images, 110 images were identified with correct emotions in real time. The proposed framework has been implemented using the Mini-Xception Deep Network because of its computational efficiency in a shorter time as compared to other networks.
Facial expression representation plays an important role in facial expression recognition. It can be viewed as generating good features for describing the appearance, structure, and motion of facial expressions. More specifically, facial expression features attempt to effectively describe the facial muscle or facial motion for static or dynamic facial images. Numerous works have already done this and, although different proposed methods for facial expression recognition have achieved good results, there remain different problems that need to be addressed by the research community. The most important one is face variability in a single person. There are many factors that can cause two pictures from the same person to look totally different, such as light, face expression, or occlusion. Another problem to be taken into account is the environment. Except in controlled scenarios, face pictures have very different backgrounds, which can make the problem of face recognition more difficult. To address this issue, many of the most successful systems focus on treating the face alone, discarding all the surroundings. Smart meeting, video conferencing, and visual surveillance are some of the real-world applications that require a facial expression recognition system that works adequately on low resolution images. There exist lots of methods for facial expression recognition but very few of those methods provide results or work adequately on low resolution images. More research effort is required to be put forth for recognizing more complex facial expressions than the six classical ones, such as fatigue, pain, and mental states such as agreeing, disagreeing, lying, frustration, thinking, as they have numerous application areas. Other problems include expression intensity estimation, spontaneous expression recognition, micro expression recognition (brief, involuntary facial expression, lasts only 1/25 to 1/15 of a second), mis-alignment problems, illumination, and face pose variation. Moreover, studies proved that visual captures of facial expressions alone are not sufficient to identify the exact human emotions discussed in this section. This research can be further carried out by combining FER systems with various physiological sensors to identify the exact mental state of a person.