IoMT Based Facial Emotion Recognition System Using Deep Convolution Neural Networks

: Facial emotion recognition (FER) is the procedure of identifying human emotions from facial expressions. It is often difﬁcult to identify the stress and anxiety levels of an individual through the visuals captured from computer vision. However, the technology enhancements on the Internet of Medical Things (IoMT) have yielded impressive results from gathering various forms of emotional and physical health-related data. The novel deep learning (DL) algorithms are allowing to perform application in a resource-constrained edge environment, encouraging data from IoMT devices to be processed locally at the edge. This article presents an IoMT based facial emotion detection and recognition system that has been implemented in real-time by utilizing a small, powerful, and resource-constrained device known as Raspberry-Pi with the assistance of deep convolution neural networks. For this purpose, we have conducted one empirical study on the facial emotions of human beings along with the emotional state of human beings using physiological sensors. It then proposes a model for the detection of emotions in real-time on a resource-constrained device, i.e., Raspberry-Pi, along with a co-processor, i.e., Intel Movidius NCS2. The facial emotion detection test accuracy ranged from 56% to 73% using various models, and the accuracy has become 73% performed very well with the FER 2013 dataset in comparison to the state of art results mentioned as 64% maximum. A t -test is performed for extracting the signiﬁcant difference in systolic, diastolic blood pressure, and the heart rate of an individual watching three different subjects (angry, happy, and neutral).


Introduction
Internet of Medical Things (IoMT) is an emerging technology that is widely spread in health care management applications for assisting patients in real-time scenarios [1,2]. IoMT is the amalgamation of smart sensors, wirelessly connected devices, and medical devices. At present, it is also monitoring the emotional, physiological, and vital states with the assistance of wearable devices and non-invasive off-the-shelf hardware [3]. In the modern era, the amount of stress and anxiety is huge due to success and failure in their respective work, and it further leads to the suicide of the person. Stress and anxiety are sensed through the facial emotion of an individual. Facial expressions have maximum magnitude over the words during a personal conversation. Researchers have recently recommended building robust and dedicated devices for distinguishing moods A t-test is performed for extracting the significant difference in systolic, diastolic blood pressure, and heart rate of an individual during watching three different subjects (angry, happy, and neutral).
The organization of the study is as follows: Section 2 provides the related works. Section 3 covers the proposed methodology that provides the methods implemented in this study. Section 4 covers the system development, where the customized vision mote and wrist band are addressed. Section 5 covers results and validation of real-time experimental setup, where the results are obtained from the distinct individuals in real-time, and t-test validation is also discussed. Section 6 concludes the paper.

Related Work
Deep learning techniques can also be used to improve efficiency. For example, a synthetic data generation unit was designed to secretively generate faces with varying expressive saturations using a 3D convolutional neural network (CNN) [16,17]. Many descriptive approaches to interaction forms of emotions are included in the classification of the input data, and the CNN network is an effective algorithm of deep learning [18,19]. The CNN modeling program was used to describe a model, which was taught using different FER datasets, and to show the capacity to classify networks that provide emotional training in both data sets and actual FER activities [20]. The first derives spatial existence from picture collections, while the second derives spatial structure from temporal facial landmarks. These two versions are fused using a modern convergence method to improve facial speech recognition efficiency. A system in which face expressions can be employed to obtain the relevant content of a video stream in which characteristics are retrieved utilizing a descriptor of histogram-oriented gradients (HOG) with an even local ternary pattern (U-LTP) [21].
The proposed system is effective for face coding through interpreting the distinct emotions both individually and simultaneously through the contraction of the facial muscles and their relaxation [22]. It reveals that several muscle movements are treated as action units that can track expressions, and these units can be combined in different emotions to determine people's moods. The first and most important step in all FER systems is face detection, which is still a difficult task due to a variety of issues such as image compression artifacts, high lighting, or low resolutions, and so on [23]. Contextual Multi-Scale Region-based Convolution Neural Network (CMS-RCNN) method is applied in which it detects the face even when it is inverted or at a very bleak angle. The amalgamation of SIFT and CNN features to recognize facial expressions and authors conclude that the proposed system can be trained with fewer images for obtaining tremendous accuracy [24]. Proposed a Human-Robot Interaction for facial expression recognition, as well as a conditional generative adversarial network (cGAN). It was utilized to identify the expression through discriminative representations achieved in 3D [25].
A CNN-based mechanism for facial expression recognition is proposed, and a feature extraction module, the focus module, the rebuilt module, and the classification module, four forms of processing were carried out [26]. A neural hybrid deep learning method is proposed especially for socially beneficial robots to recognize the emotions from facial

Related Work
Deep learning techniques can also be used to improve efficiency. For example, a synthetic data generation unit was designed to secretively generate faces with varying expressive saturations using a 3D convolutional neural network (CNN) [16,17]. Many descriptive approaches to interaction forms of emotions are included in the classification of the input data, and the CNN network is an effective algorithm of deep learning [18,19]. The CNN modeling program was used to describe a model, which was taught using different FER datasets, and to show the capacity to classify networks that provide emotional training in both data sets and actual FER activities [20]. The first derives spatial existence from picture collections, while the second derives spatial structure from temporal facial landmarks. These two versions are fused using a modern convergence method to improve facial speech recognition efficiency. A system in which face expressions can be employed to obtain the relevant content of a video stream in which characteristics are retrieved utilizing a descriptor of histogram-oriented gradients (HOG) with an even local ternary pattern (U-LTP) [21].
The proposed system is effective for face coding through interpreting the distinct emotions both individually and simultaneously through the contraction of the facial muscles and their relaxation [22]. It reveals that several muscle movements are treated as action units that can track expressions, and these units can be combined in different emotions to determine people's moods. The first and most important step in all FER systems is face detection, which is still a difficult task due to a variety of issues such as image compression artifacts, high lighting, or low resolutions, and so on [23]. Contextual Multi-Scale Region-based Convolution Neural Network (CMS-RCNN) method is applied in which it detects the face even when it is inverted or at a very bleak angle. The amalgamation of SIFT and CNN features to recognize facial expressions and authors conclude that the proposed system can be trained with fewer images for obtaining tremendous accuracy [24]. Proposed a Human-Robot Interaction for facial expression recognition, as well as a conditional generative adversarial network (cGAN). It was utilized to identify the expression through discriminative representations achieved in 3D [25].
A CNN-based mechanism for facial expression recognition is proposed, and a feature extraction module, the focus module, the rebuilt module, and the classification module, four forms of processing were carried out [26]. A neural hybrid deep learning method is proposed especially for socially beneficial robots to recognize the emotions from facial  [27,28]. The extraction of functions for both regulated and uncontrolled images inspired by human vision was introduced, and the Gabor filter was used for reducing computer expenses and vector lengths for large features [29]. Presenting multi-face characteristics and supporting vector machines to enhance the analytical efficiency of facial expression recognition. Three system forms were employed, namely the discrete transform cosine (DCT), angular transform radial (ART), and the Gabor filter (GF) [30].
Face Expression Recognition Function Fusion Network (FFN-FER) is proposed, where its focus on a common IC channel and inter-category feature distinctions (ID) channel within the category, and the IC was used to find out the standard features and ID to analyze features [31]. Attempts to define and recognize emotions in a few typical Hollywood film clips suggested a multi-layer cognitive framework, adopted BP algorithm to optimize and learn network weights, and used the spatial relationship projection to minimize model parameters to boost training efficiency [32]. A system is proposed for real-time monitoring of dustbins using image processing and Raspberry-Pi for overcoming the overflown of waste from the bins [33]. In this paper, we presented an approach for real-time facial landmark detection and feature extraction, which is the most critical prerequisite for emotion recognition systems through Raspberry-Pi [34]. An image and video capturingbased system is proposed for monitoring the sanitation in the premises of a hospital through Raspberry-Pi and Arduino (Unique India, Delhi, India) [35]. In order of recognizing the facial emotion through speech data, where memory and computer requirements are to be limited, three different state-of-the-art selection methods were examined: ILFS, ReliefF, and Fisher, and compared them with the proposed 'Active Feature Selection' (AFS) features selection process [36,37]. Based on robust features and machine learning from audio language, new emotion recognition is proposed. The audio data used as input in the device from which Mel Frequency Cepstrum Coefficients (MFCCs) were measured as features for a person-independent emotional recognition system [38].
The associated facial emotion factors and corresponding tests to validate these factors were investigated. Furthermore, some researchers have tried to capture body characteristics with ultrasonic radio frequency signals, and the cell phone and emotional recognition algorithms would differ based on variations with acquisition devices [32,39]. In [40], the method for the enhancement of the less complex processing essence of programmed human-machine interactions (HMI) in health monitoring is extended to a multi-modal visualization analysis (MMVA). The proposed method is designed particularly to define a patient's facial expressions using input visualization facial expressions and textures. This paper introduces a new WiFi-based method of facial expression recognition called WiFE. Our main insight is that in various expressions, facial muscular activity produces distinctive waveform patterns in channel state information (CSI) time series in Wireless Local Area Network (WLAN) signals [41]. A deep tree-based model is proposed in a cloud environment for automatic facial recognition, and the proposed deep model is less costly computationally without affecting its accuracy [42]. The traditional face recognition method fails to predict the exact facial characteristics that minimize the accuracy of face recognition. The method of fabricating facial points maximizes calculating sophistication. This work is then used to predict and fit the face of the database with an accurate and artificial intelligence Internet based on a face expression detection system [43,44].
From the above literature, it is identified that computer vision devices need to be correlated with the sensor data of an individual for implementing an effective facial emotion recognition system. In order to do that first, the human facial emotion is captured using Raspberry-Pi using a timestamp. At the same time, the physiological sensor values are recorded for the same subject. The expression captured is then validated via the physiological value recorded via the device itself. If both values are found to statistically significant, then the expression is considered a valid expression.

Proposed Methodology
In this method, facial emotion recognition has been conducted on Raspberry-Pi itself instead of the cloud. The pre-processing step is similar in this method also. The deep network has been trained and imported on Raspberry-Pi, and the process has been speedup via co-processor, i.e., Intel Movidius Neural Compute Stick 2. The detailed architecture using this method is shown in Figure 2.
Step 1. Selection of Deep Convolutional Neural Network: Working with a resourceconstrained device, such as Raspberry-Pi, also needs architectures that do not require less power and occupy lesser space and perform fast processing. Therefore, working with VGG and ResNet will not be suitable as they require 200-500 MB that is huge for resource-constrained devices due to their sheer size and capability to perform computations. So, working architectures, such as Mobile Nets, will work here, which are different from conventional CNNs as they use depth-wise separable convolution. So, the Mini_Xception model [45] has been used to train the FER 2013 dataset (https://www. kaggle.com/msambare/fer2013, accessed on 5 July 2018) as this network splits the convolution into two stages. The first stage is a 3 × 3 depth-wise convolution, and the second stage performs 1 × 1 pointwise convolution. That is the major point that helps in reducing the number of parameters in the network. The only thing that needs to be compromised is accuracy because these networks are not as accurate as actual CNNs.
Step 2. Training of pre-trained network and importing: OpenCV 3.3 launched back in 2017 And was a highly improved deep neural network module. This module supports several frameworks, which include Caffe, Tensor Flow, and PyTorch/Torch as well. The Caffe module has been used in this to import on Raspberry-Pi. The network has been trained via the FER 2013 dataset using Google Co-Lab with K80 GPU (NVIDIA, Santa Clara, CA, USA). Once the network has been trained, Prototxt files that define the model itself having all layers in it and the Caffe model file that contain all the weights of actual layers are imported and parse command-line arguments. First, the model is loaded via Prototxt and model file paths and then stored the model as a net.

Proposed Methodology
In this method, facial emotion recognition has been conducted on Raspberry-Pi itself instead of the cloud. The pre-processing step is similar in this method also. The deep network has been trained and imported on Raspberry-Pi, and the process has been speedup via co-processor, i.e., Intel Movidius Neural Compute Stick 2. The detailed architecture using this method is shown in Figure 2.
Step 1. Selection of Deep Convolutional Neural Network: Working with a resourceconstrained device, such as Raspberry-Pi, also needs architectures that do not require less power and occupy lesser space and perform fast processing. Therefore, working with VGG and ResNet will not be suitable as they require 200-500 MB that is huge for resourceconstrained devices due to their sheer size and capability to perform computations. So, working architectures, such as Mobile Nets, will work here, which are different from conventional CNNs as they use depth-wise separable convolution. So, the Mini_Xception model [45] has been used to train the FER 2013 dataset (https://www.kaggle.com/msambare/fer2013, accessed on 5 July 2018) as this network splits the convolution into two stages. The first stage is a 3 × 3 depth-wise convolution, and the second stage performs 1 × 1 pointwise convolution. That is the major point that helps in reducing the number of parameters in the network. The only thing that needs to be compromised is accuracy because these networks are not as accurate as actual CNNs.
Step 2. Training of pre-trained network and importing: OpenCV 3.3 launched back in 2017 And was a highly improved deep neural network module. This module supports several frameworks, which include Caffe, Tensor Flow, and PyTorch/Torch as well. The Caffe module has been used in this to import on Raspberry-Pi. The network has been trained via the FER 2013 dataset using Google Co-Lab with K80 GPU (NVIDIA, Santa Clara, CA, USA). Once the network has been trained, Prototxt files that define the model itself having all layers in it and the Caffe model file that contain all the weights of actual layers are imported and parse command-line arguments. First, the model is loaded via Prototxt and model file paths and then stored the model as a net. Step 3. Feeding the pre-processed image: The next step is to feed the pre-processed image into the network. The pre-processing stages are already explained in the first method. Pre-processing includes setting the blob dimensions and normalization. Step 3. Feeding the pre-processed image: The next step is to feed the pre-processed image into the network. The pre-processing stages are already explained in the first method. Pre-processing includes setting the blob dimensions and normalization.

System Development
This section explains the detailed steps of hardware development of the system, which is capable of detecting the real-time facial emotions of human beings. The complete hardware development is divided into two parts. The first part is the vision node, which has a camera, Raspberry-Pi, and Servo motors to provide pan and tilt and co-processor implements the deep network; the other part of the system is a wrist band, which has two sensors, i.e., the Heart Rate and BP sensors (Sunrom, Ahmedabad, India), to record the physiological values of the person and correlated with facial expression. The expressions recorded with this device were verified by the physiological values and gave a close relationship between the device and the values recorded by it.
The main part of the vision node is Raspberry-Pi and the Pi camera, as shown in Figure 3. In the vision node, servo motors were used to pan and turn the device to monitor a face in real-time, and moreover, this node contains both RF Modem and Wi-Fi to store the data in the cloud.

System Development
This section explains the detailed steps of hardware development of the system, which is capable of detecting the real-time facial emotions of human beings. The complete hardware development is divided into two parts. The first part is the vision node, which has a camera, Raspberry-Pi, and Servo motors to provide pan and tilt and co-processor implements the deep network; the other part of the system is a wrist band, which has two sensors, i.e., the Heart Rate and BP sensors (Sunrom, Ahmedabad, India), to record the physiological values of the person and correlated with facial expression. The expressions recorded with this device were verified by the physiological values and gave a close relationship between the device and the values recorded by it.
The main part of the vision node is Raspberry-Pi and the Pi camera, as shown in Figure 3. In the vision node, servo motors were used to pan and turn the device to monitor a face in real-time, and moreover, this node contains both RF Modem and Wi-Fi to store the data in the cloud. The RF modem is used to collect the wearable band data and to transfer the information to the server. The LCD is used to display the captured values from the wristband. The wrist band has physiological sensors on it, i.e., BP sensor and Heart Rate sensor. The sensor values are received by Raspberry-Pi using Pyfirmata and combined with the facial emotion images that are captured via the Pi camera in real-time. The data gathered from both the camera and sensors are then correlated, and conclusive expression and confidence have been extracted to understand the facial emotion of the person in real-time. Figure 4 presents the customized vision mote with the complete package.   The RF modem is used to collect the wearable band data and to transfer the information to the server. The LCD is used to display the captured values from the wristband. The wrist band has physiological sensors on it, i.e., BP sensor and Heart Rate sensor. The sensor values are received by Raspberry-Pi using Pyfirmata and combined with the facial emotion images that are captured via the Pi camera in real-time. The data gathered from both the camera and sensors are then correlated, and conclusive expression and confidence have been extracted to understand the facial emotion of the person in real-time. Figure 4 presents the customized vision mote with the complete package.

System Development
This section explains the detailed steps of hardware development of the system, which is capable of detecting the real-time facial emotions of human beings. The complete hardware development is divided into two parts. The first part is the vision node, which has a camera, Raspberry-Pi, and Servo motors to provide pan and tilt and co-processor implements the deep network; the other part of the system is a wrist band, which has two sensors, i.e., the Heart Rate and BP sensors (Sunrom, Ahmedabad, India), to record the physiological values of the person and correlated with facial expression. The expressions recorded with this device were verified by the physiological values and gave a close relationship between the device and the values recorded by it.
The main part of the vision node is Raspberry-Pi and the Pi camera, as shown in Figure 3. In the vision node, servo motors were used to pan and turn the device to monitor a face in real-time, and moreover, this node contains both RF Modem and Wi-Fi to store the data in the cloud. The RF modem is used to collect the wearable band data and to transfer the information to the server. The LCD is used to display the captured values from the wristband. The wrist band has physiological sensors on it, i.e., BP sensor and Heart Rate sensor. The sensor values are received by Raspberry-Pi using Pyfirmata and combined with the facial emotion images that are captured via the Pi camera in real-time. The data gathered from both the camera and sensors are then correlated, and conclusive expression and confidence have been extracted to understand the facial emotion of the person in real-time. Figure 4 presents the customized vision mote with the complete package.    Figure 5a illustrates the block diagram of the wrist band that can collect the physiological values of the person via two different sensors, i.e., the BP sensor and the Heart Rate sensor. The BP sensor can detect the systolic and diastolic values of heart rate and display the same on the LCD as well as on the cloud. The Heart Rate sensor will sense the values of heart rate and display the same on the LCD and send all the values to the cloud also. The physiological sensor values collected along with the real-time emotion detected images and conclusion are taken by looking into the threshold and recording values on Raspberry-Pi itself.
Electronics 2021, 10, x FOR PEER REVIEW 7 of 23 Figure 5a illustrates the block diagram of the wrist band that can collect the physiological values of the person via two different sensors, i.e., the BP sensor and the Heart Rate sensor. The BP sensor can detect the systolic and diastolic values of heart rate and display the same on the LCD as well as on the cloud. The Heart Rate sensor will sense the values of heart rate and display the same on the LCD and send all the values to the cloud also. The physiological sensor values collected along with the real-time emotion detected images and conclusion are taken by looking into the threshold and recording values on Raspberry-Pi itself.  This provides an RF Modem that has the capability to send to an RF modem in another part of the room. Moreover, the other node has a Wi-Fi module also that is uploading the recorded values on the cloud. The band is working with a LIPO battery, so it has quite a long backup and can work for long. The battery is rechargeable, and very low power is required to perform that. The vision mote is shown in Figure 6, which is displaying the real-time values of BP and heart rate and uploading the same on the cloud also via a Wi-Fi module.   i.e., Heart Rate and BP, attached with it. This provides an RF Modem that has the capability to send to an RF modem in another part of the room. Moreover, the other node has a Wi-Fi module also that is uploading the recorded values on the cloud. The band is working with a LIPO battery, so it has quite a long backup and can work for long. The battery is rechargeable, and very low power is required to perform that. The vision mote is shown in Figure 6, which is displaying the real-time values of BP and heart rate and uploading the same on the cloud also via a Wi-Fi module.
Electronics 2021, 10, x FOR PEER REVIEW 7 of 23 Figure 5a illustrates the block diagram of the wrist band that can collect the physiological values of the person via two different sensors, i.e., the BP sensor and the Heart Rate sensor. The BP sensor can detect the systolic and diastolic values of heart rate and display the same on the LCD as well as on the cloud. The Heart Rate sensor will sense the values of heart rate and display the same on the LCD and send all the values to the cloud also. The physiological sensor values collected along with the real-time emotion detected images and conclusion are taken by looking into the threshold and recording values on Raspberry-Pi itself.  This provides an RF Modem that has the capability to send to an RF modem in another part of the room. Moreover, the other node has a Wi-Fi module also that is uploading the recorded values on the cloud. The band is working with a LIPO battery, so it has quite a long backup and can work for long. The battery is rechargeable, and very low power is required to perform that. The vision mote is shown in Figure 6, which is displaying the real-time values of BP and heart rate and uploading the same on the cloud also via a Wi-Fi module.   The complete system has been designed by customization of the boards, and the bit map of the RF modem, i.e., part of the vision mote, is shown in Figure 7a. The bit map of the wristband is also shown in Figure 7b, which has an RF modem in itself also. The threshold value that is used to detect the criticality of the situation is shown in Table 1.
Electronics 2021, 10, x FOR PEER REVIEW 8 of 23 The complete system has been designed by customization of the boards, and the bit map of the RF modem, i.e., part of the vision mote, is shown in Figure 7a. The bit map of the wristband is also shown in Figure 7b, which has an RF modem in itself also. The threshold value that is used to detect the criticality of the situation is shown in Table 1.  The two-dimensional model based on valance and arousal is shown in Figure 8. The model explains the 4 basic emotional states and corresponding primary and tertiary emotions. Figure 9 shows the experimental setup that is established to capture the real-time facial emotion of the subjects along with the physiological values that include heart rate and blood pressure. This experimental setup includes a wrist band with physiological sensors, such as heart rate and BP, and a vision node as well. The vision node is capturing the facial emotions in real-time as the subjects are made to watch the videos that can take them to various emotional states, and proper time has been given to all the subjects to carry out this work efficiently. It generally takes time to switch from one emotional state to another state. Therefore, proper care has been taking in that direction while conducting this experiment.  The two-dimensional model based on valance and arousal is shown in Figure 8. The model explains the 4 basic emotional states and corresponding primary and tertiary emotions. Figure 9 shows the experimental setup that is established to capture the real-time facial emotion of the subjects along with the physiological values that include heart rate and blood pressure. This experimental setup includes a wrist band with physiological sensors, such as heart rate and BP, and a vision node as well. The vision node is capturing the facial emotions in real-time as the subjects are made to watch the videos that can take them to various emotional states, and proper time has been given to all the subjects to carry out this work efficiently. It generally takes time to switch from one emotional state to another state. Therefore, proper care has been taking in that direction while conducting this experiment.

Results
In this section, a detailed description of all the experiments is given. The performance of various models is explained in this section. The experiments that have been displayed in this chapter are done on Google-CoLab with a 12GB NVIDIA Tesla K80 GPU (NVIDIA, Leeds, UK). Another model that has been used for training is the Mini_Xception model. This is a modified depth-wise separable convolutional neural network. As compared to the conventional convolution neural network, this particular model does not require performing convolution across all the channels. This particular thing makes this model lighter and also reduces the connections that are very few in comparison to conventional models. The model architecture of the Mini_Xception model is shown in Figure 10.

Results
In this section, a detailed description of all the experiments is given. The performance of various models is explained in this section. The experiments that have been displayed in this chapter are done on Google-CoLab with a 12GB NVIDIA Tesla K80 GPU (NVIDIA, Leeds, UK). Another model that has been used for training is the Mini_Xception model. This is a modified depth-wise separable convolutional neural network. As compared to the conventional convolution neural network, this particular model does not require performing convolution across all the channels. This particular thing makes this model lighter and also reduces the connections that are very few in comparison to conventional models. The model architecture of the Mini_Xception model is shown in Figure 10.

Results
In this section, a detailed description of all the experiments is given. The performance of various models is explained in this section. The experiments that have been displayed in this chapter are done on Google-CoLab with a 12GB NVIDIA Tesla K80 GPU (NVIDIA, Leeds, UK). Another model that has been used for training is the Mini_Xception model. This is a modified depth-wise separable convolutional neural network. As compared to the conventional convolution neural network, this particular model does not require performing convolution across all the channels. This particular thing makes this model lighter and also reduces the connections that are very few in comparison to conventional models. The model architecture of the Mini_Xception model is shown in Figure 10.
The major benefit of this architecture is that it does not contain any fully connected layers, and the inclusion of depth-wise separable convolutions helps to reduce the number of parameters. The introduction of residual models also enables the gradients to perform better in backpropagation to lower layers. The network is trained via Google Co-Lab in batch mode, using SGD and Adam optimizer separately, and achieves an accuracy of 69% with 35 epochs with Adam optimizer. The efficiency came maximum with Adam as SGD is more locally unstable. Instead of the Mini_Xception and Mobilenet_V2 models, the dataset has been trained on Densenet161 and the Resnet Model also. The results for these models are quite less as compared to the previous two models. Therefore, these two models are not considered for further consideration as the accuracy is quite less in comparison to the accuracy with the other two models. A brief description of the model is shown in Table 2, which explains the name of the model, accuracy, learning rate, test accuracy, and the optimizer used for the model. The major benefit of this architecture is that it does not contain any fully connected layers, and the inclusion of depth-wise separable convolutions helps to reduce the number of parameters. The introduction of residual models also enables the gradients to perform better in backpropagation to lower layers. The network is trained via Google Co-Lab in batch mode, using SGD and Adam optimizer separately, and achieves an accuracy of 69% with 35 epochs with Adam optimizer. The efficiency came maximum with Adam as SGD is more locally unstable. Instead of the Mini_Xception and Mobilenet_V2 models, the dataset has been trained on Densenet161 and the Resnet Model also. The results for these models are quite less as compared to the previous two models. Therefore, these two models are not considered for further consideration as the accuracy is quite less in comparison to the accuracy with the other two models. A brief description of the model is shown in Table 2, which explains the name of the model, accuracy, learning rate, test accuracy, and the optimizer used for the model.  Figure 11 shows the training loss, and from the graph, it is visible that loss is decreasing exponentially, and till the 35th epoch, the loss has reduced to a minimum. From the confusion matrix, it has been seen that the disgusted faces are misclassified as angry faces,   Figure 11 shows the training loss, and from the graph, it is visible that loss is decreasing exponentially, and till the 35th epoch, the loss has reduced to a minimum. From the confusion matrix, it has been seen that the disgusted faces are misclassified as angry faces, and the reason behind that is the count of disgusted faces in a dataset is the least. The major reason behind the misclassification is a non-uniform dataset; that is how the FER 2013 dataset is distributed. The model accuracy shown in Figure 11 has reached the training accuracy of 73%. The accuracy that has been achieved with a model using 35 epochs is quite high and can be considered for deployment on the system for real-time facial emotion detection. and the reason behind that is the count of disgusted faces in a dataset is the least. The major reason behind the misclassification is a non-uniform dataset; that is how the FER 2013 dataset is distributed. The model accuracy shown in Figure 11 has reached the training accuracy of 73%. The accuracy that has been achieved with a model using 35 epochs is quite high and can be considered for deployment on the system for real-time facial emotion detection.  The setup has been used to detect facial emotions in real-time and validate the same via physiological sensors. The wrist band is designed to record the physiological values such as heart rate and blood pressure of the subject under various situations. The source of empirical data in this experimentation is the facial emotions of the subjects with a timestamp. At the same time, the data gathered via physiological sensors at the same time.
The subjects are set to see the videos that help them to enter various emotional states, and then the heart rate and blood pressure of those subjects, along with facial expression, are captured at the same time. The recorded values with time stamps and facial images with a timestamp are then used further to validate the expression recorded via the system. The system has been designed particularly to detect emotions with two tiers of validation. Firstly, via facial images only and then for further validation of the extracted emotions, physiological sensors have been used. Table 3 shows the description of videos that have been used to carry out the experiment using different subjects. As the proposed system is being tested and designed for edge devices, and it has been found in the prior literature The setup has been used to detect facial emotions in real-time and validate the same via physiological sensors. The wrist band is designed to record the physiological values such as heart rate and blood pressure of the subject under various situations. The source of empirical data in this experimentation is the facial emotions of the subjects with a timestamp. At the same time, the data gathered via physiological sensors at the same time.
The subjects are set to see the videos that help them to enter various emotional states, and then the heart rate and blood pressure of those subjects, along with facial expression, are captured at the same time. The recorded values with time stamps and facial images with a timestamp are then used further to validate the expression recorded via the system. The system has been designed particularly to detect emotions with two tiers of validation. Firstly, via facial images only and then for further validation of the extracted emotions, physiological sensors have been used. Table 3 shows the description of videos that have been used to carry out the experiment using different subjects. As the proposed system is being tested and designed for edge devices, and it has been found in the prior literature that real implementation on a resource-constrained device such as Raspberry-Pi is challenging [9]. So, to begin with, initially, four expressions have been recorded and validated. In the future, more powerful embedded boards can be used to increase the efficiency of the system. The different videos that can bring any person to a happy, sad, and angry state are mentioned in this Table 3. Radar plots for variation in blood pressure under happy, angry, and neutral states are illustrated in Figure 12. The experiment has been recorded for 20 different people, but for validation purposes, only three subjects with five different observations under four different emotional states have been used, as shown in Table 4. that real implementation on a resource-constrained device such as Raspberry-Pi is challenging [9]. So, to begin with, initially, four expressions have been recorded and validated. In the future, more powerful embedded boards can be used to increase the efficiency of the system. The different videos that can bring any person to a happy, sad, and angry state are mentioned in this Table 3. Radar plots for variation in blood pressure under happy, angry, and neutral states are illustrated in Figure 12. The experiment has been recorded for 20 different people, but for validation purposes, only three subjects with five different observations under four different emotional states have been used, as shown in Table 4.     The experimental values of 20 different subjects under different emotional states are recorded, and Table 4 shows the recorded values of three different subjects for basic emotions such as anger, neutral, happy, and sad. The values shown in Table 4 are recorded under an experimental environment where the three different subjects sat and wore a wrist band that consists of a heart rate sensor and blood pressure sensor on it. Once the subject was wearing this sensor, the setup of Raspberry-Pi with a Pi camera has also started to capture the expression of the person in real-time along with the physiological values. Figure 13 illustrates the captured facial expressions with a time stamp.
recorded, and Table 4 shows the recorded values of three different subjects for basic emotions such as anger, neutral, happy, and sad. The values shown in Table 4 are recorded under an experimental environment where the three different subjects sat and wore a wrist band that consists of a heart rate sensor and blood pressure sensor on it. Once the subject was wearing this sensor, the setup of Raspberry-Pi with a Pi camera has also started to capture the expression of the person in real-time along with the physiological values. Figure 13 illustrates the captured facial expressions with a time stamp. A time-synchronized algorithm has been used to capture the facial expression and physiological values of the participants, i.e., heart rate and blood pressure. To validate the results, various analyses have been conducted. It has been found in the literature that emotional arousal increases systolic and diastolic blood pressure. Moreover, it has also been found in the literature that happiness, anger, and anxiety increase blood pressure, and the level of variation is dependent upon the individuals. In order to visualize the physiological values, box plots have been plotted. Figure 14a shows the box plot of the systolic blood pressure of the participants for all four expressions. From the box plot, it is clear that sadness tends to decrease the systolic blood pressure of the participants to the lowest, and on the other hand, anger and happiness tend to increase the systolic blood pressure of the participants. The first quartile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the systolic values for that particular expression. Figure 14b shows the box plot of the diastolic blood pressure of the participants for all four expressions. From the box plot, it is clear that sadness tends to decrease the diastolic blood pressure of the participants to the lowest, and on the other hand, anger and A time-synchronized algorithm has been used to capture the facial expression and physiological values of the participants, i.e., heart rate and blood pressure. To validate the results, various analyses have been conducted. It has been found in the literature that emotional arousal increases systolic and diastolic blood pressure. Moreover, it has also been found in the literature that happiness, anger, and anxiety increase blood pressure, and the level of variation is dependent upon the individuals. In order to visualize the physiological values, box plots have been plotted. Figure 14a shows the box plot of the systolic blood pressure of the participants for all four expressions. From the box plot, it is clear that sadness tends to decrease the systolic blood pressure of the participants to the lowest, and on the other hand, anger and happiness tend to increase the systolic blood pressure of the participants. The first quartile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the systolic values for that particular expression. Figure 14b shows the box plot of the diastolic blood pressure of the participants for all four expressions. From the box plot, it is clear that sadness tends to decrease the diastolic blood pressure of the participants to the lowest, and on the other hand, anger and happiness tend to increase the diastolic blood pressure of the participants. The first quartile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the systolic values for that particular expression. One outlier, i.e., the fourth recorded value of subject 1 from Table 4, where the value is comparatively high as compared to other recorded values, i.e., 114, has been depicted as an outlier.
happiness tend to increase the diastolic blood pressure of the participants. The first quar-tile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the systolic values for that particular expression. One outlier, i.e., the fourth recorded value of subject 1 from Table 4, where the value is comparatively high as compared to other recorded values, i.e., 114, has been depicted as an outlier.  Figure 14c shows the box plot of the heart rate variation of the participants for all four expressions. From the box plot, it is clear that anger raises the blood pressure of participants to the maximum level while the neutral state has shown the minimum. The first quartile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the Heart rate values for that particular state. Two outliers for the neutral state, i.e., 26th and 27th value recorded and located at the 26th and 27th value in Table 4. Both the values are the  Figure 14c shows the box plot of the heart rate variation of the participants for all four expressions. From the box plot, it is clear that anger raises the blood pressure of participants to the maximum level while the neutral state has shown the minimum. The first quartile and third quartile for each expression are also shown, which shows 25% and 75% of the values are lying under these quartiles. The medians for all the recorded values are also labeled on the box plot of each expression, which depicts the distribution of the Heart rate values for that particular state. Two outliers for the neutral state, i.e., 26th and 27th value recorded and located at the 26th and 27th value in Table 4. Both the values are the same and comparatively low when compared to other recorded values, i.e., 61, hence depicted as an outlier. In order to validate the variation of physiological recorded values for various mental states, a paired-sample t-sample t-test has been applied. A paired-sample t-test is the statistical solution and is mainly used when we want to see if the mean difference between the two sets of observations has been found or not. Therefore, to validate our variation on the experimentally recorded values, this test has been utilized.

a. Paired-Sample t-test Analysis between Happy and Neutral States
H 1 = There is a significant decrease in the systolic blood pressure of participants when their emotional state is changing from happy to neutral.
where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference.
A paired-sample t-test was conducted to compare the systolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for happy and neutral videos. There was a significant difference in systolic blood pressure while watching happy videos (M = 133.333, SD = 7.7429) and systolic blood pressure while watching neutral videos (M = 114.400, SD = 7.3853) conditions; t (14) = 5.157, p = 0.000 are shown in Table 5. There was a significant decrease in the systolic blood pressure when participants watched neutral videos after watching happy videos. Hence enough evidence has been found that shows the mean difference between the systolic blood pressure of participants is statistically significant when their emotional state is changing from happy to neutral. Hence, the hypothesis is accepted, which says that there is a significant decrease in the systolic blood pressure of participants when their emotional state is changing from happy to neutral.
H 1 (Alternate Hypothesis) = There is a significant decrease in the diastolic blood pressure of participants when their emotional state is changing from happy to neutral where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference. A paired-sample t-test was conducted to compare the diastolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for happy and neutral videos. There was a significant difference in diastolic blood pressure while watching happy videos (M = 99.400, SD = 7.0791) and diastolic blood pressure while watching neutral videos (M = 73.933, SD = 7.3918) conditions; t (14) = 12.222, p = 0.000 are shown in Table 6. There was a significant decrease in the diastolic blood pressure when participants watched neutral videos after watching happy videos.
Hence enough evidence has been found that shows the mean difference between the diastolic blood pressure of participants is statistically significant when their emotional state is changing from happy to neutral. Therefore, the hypothesis is accepted, which says that there is a significant decrease in the diastolic blood pressure of participants when their emotional state is changing from happy to neutral. H 1 (Alternate Hypothesis) = There is a significant decrease in the heart rate of participants when their emotional state is changing from happy to neutral.
where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference.
A paired-sample t-test was conducted to compare the heart rate of participants while watching different videos using an experimental setup for 20 participants for happy and neutral videos. There was a significant difference in Heart rate while watching happy videos (M = 85.733, SD = 8.9400) and diastolic blood pressure while watching neutral videos (M = 73.267, SD = 5.4178) conditions; t (14) = 5.983, p = 0.000 are shown in Table 7. There was a significant decrease in the diastolic blood pressure when participants watched neutral videos after watching happy videos.
Hence enough evidence has been found that shows the mean difference between the heart rate of participants is statistically significant when their emotional state is changing from happy to neutral. Therefore, the hypothesis is accepted, which says that there is a significant decrease in the heart rate of participants when their emotional state is changing from happy to neutral.

b. Paired-Sample t-test Analysis on Neutral and Angry States:
H 1 (Alternate Hypothesis) = There is a significant increase in the systolic blood pressure of participants when their emotional state is changing from neutral to angry.
where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference. A paired-sample t-test was conducted to compare the Systolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for neutral and angry videos. There was a significant difference in systolic blood pressure while watching neutral videos (M = 114.400, SD = 7.3853) and diastolic blood pressure while watching angry videos (M = 136.533, SD = 4.4379) conditions; t (14) = −8.137, p = 0.000 are shown in Table 8. There was a significant increase in the Systolic blood pressure when participants watched neutral videos after watching happy videos.
Hence enough evidence has been found that shows the mean difference between systolic blood pressure of participants is statistically significant when their emotional state is changing from neutral to angry. Therefore, the hypothesis is accepted, which says that there is a significant increase in the systolic blood pressure of participants when their emotional state is changing from neutral to angry. H 1 (Alternate Hypothesis) = There is a significant increase in the diastolic blood pressure of participants when their emotional state is changing from neutral to angry.
where µ 1 − µ 2 is the difference between the hypothesis means, and ∂ 0 is the hypothesized difference.
A paired-sample t-test was conducted to compare the diastolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for neutral and angry videos. There was a significant difference in Systolic blood pressure while watching neutral videos (M = 73.933, SD = 7.3918) and diastolic blood pressure while watching angry videos (M = 110.200, SD = 1.4736) conditions; t (14) = −17.413, p = 0.000 are shown in Table 9. There was a significant increase in the diastolic blood pressure when participants watched neutral videos after watching happy videos. Hence enough evidence has been found that shows the mean difference between diastolic blood pressure of participants is statistically significant when their emotional state is changing from neutral to angry. Hence the hypothesis is accepted, which says that there is a significant increase in the diastolic blood pressure of participants when their emotional state is changing from neutral to angry. H 1 (Hypothesis) = There is a significant increase in the heart rate of participants when their emotional state is changing from neutral to angry.
where µ 1 − µ 2 is the difference between the hypothesis means, and ∂ 0 is the hypothesized difference. A paired-sample t-test was conducted to compare the systolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for neutral and angry videos. There was a significant difference in heart rate while watching neutral videos (M = 73.267, SD = 5.4178) and heart rate while watching angry videos (M = 98.067, SD = 9.6471) conditions; t (14) = −7.170, p = 0.000 are shown in Table 10. There was a significant increase in the heart rate when participants watched angry videos after watching neutral videos. Hence enough evidence has been found that shows the mean difference between the heart rate of participants is statistically significant when their emotional state is changing from neutral to angry. Hence the hypothesis is accepted, which says that there is a significant increase in the heart rate of participants when their emotional state is changing from neutral to angry.

c. Paired-Sample t-test Analysis on the Angry and Sad States
H 1 (Alternate Hypothesis) = There is a significant decrease in the systolic blood pressure of participants when their emotional state is changing from angry to sad.
where µ 1 − µ 2 the difference between the hypotheses means and ∂ 0 is the hypothesized difference.
A paired-sample t-test was conducted to compare the systolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for angry and sad videos. There was a significant difference in systolic blood pressure while watching angry videos (M = 136.533, SD = 4.4379) and systolic blood pressure while watching sad videos (M = 90.200, SD = 2.7568) conditions; t (14) = 31.535, p = 0.000 are shown in Table 11. There was a significant decrease in the diastolic blood pressure when participants watched neutral videos after watching happy videos.
Hence enough evidence has been found that shows the mean difference between the diastolic blood pressure of participants is statistically significant when their emotional state is changing from happy to neutral. Hence the hypothesis is accepted, which says that there is a significant decrease in the systolic blood pressure of participants when their emotional state is changing from angry to sad. H 1 (Alternate Hypothesis) = There is a significant decrease in the diastolic blood pressure of participants when their emotional state is changing from angry to sad.
where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference. A paired-sample t-test was conducted to compare the diastolic blood pressure of participants while watching different videos using an experimental setup for 20 participants for angry and sad videos. There was a significant difference in diastolic blood pressure while watching angry videos (M = 110.200, SD = 1.4736) and in diastolic blood pressure while watching sad videos (M = 76.53, SD = 3.563) conditions; t (14) = 33.722, p = 0.000 are shown in Table 12. There was a significant decrease in the diastolic blood pressure when participants watched neutral videos after watching happy videos.
Hence enough evidence has been found that shows the mean difference between the diastolic blood pressure of participants is statistically significant when their emotional state is changing from angry to sad. Hence the hypothesis is accepted, which says that there is a significant decrease in the diastolic blood pressure of participants when their emotional state is changing from angry to sad. H 0 (Null Hypothesis) = There is no significant difference in the heart rate of the participants when their emotional state is changing from angry to sad.
H 1 (Alternate Hypothesis) = There is a significant difference in the heart rate of the participants when their emotional state is changing from angry to sad.
where µ 1 − µ 2 is the difference between the hypotheses means, and ∂ 0 is the hypothesized difference. A paired-sample t-test was conducted to compare the heart rate of participants while watching different videos using an experimental setup for 20 participants for angry to sad videos. There was no significant difference in the heart rate while watching angry videos (M = 98.067, SD = 9.6471) and while watching sad videos (M = 96.067, SD = 5.8367) conditions; t (14) = 0.587, p = 0.566 are shown in Table 13.
Hence enough evidence has been found that shows the mean difference between heart rate of not the participants is statistically significant when their emotional state is changing from angry to sad. Hence the null hypothesis is accepted, which says that there is no significant difference in the heart rate of the participants when their emotional state is changing from angry to sad.  Table 14 illustrates the complete values of three kinds of pair states, including happy to neutral, neutral to angry, and angry to sad, concerning three parameters, namely systolic BP, diastolic BP, and heart rate.  Tailed  Table 14 also shows the close correlation between the data that has been captured via the physiological sensor and the expression that has been captured via the device. Enough evidence has been found in the experiment that shows that the paired state variation with parameter variation is statistically significant.

Conclusions
In this article, we have designed and implement an IoMT based portable FER edge device to recognize the facial emotion of an individual. FER is achieved by interfering with the systole, diastole, and heart rate sensor data of an individual with visuals capture through an edge device. The edge device is integrated with Intel Movidius neural com-puting stick2 (NCS2), and a deep convolutional neural network implemented on NCS2 enables the edge device to recognize facial emotion accurately. The facial emotion detection test accuracy ranged from 56% to 73% using various models, and the accuracy became 73% and performed very well with the FER 2103 dataset in comparison to the state of art results with a 64% maximum. Finally, a t-test validation is conducted for identifying the significant differences in systolic, diastolic, and heart rate of an individual during watching the different subjects' visuals clips.
The primary goal of this work is to develop a system that can replace the existing bulky, wired and system-dependent system that almost makes the work of face and facial emotion detection impossible while walking on roads, airports, hospitals, and public places. One must spend a lot to receive the benefit of such a system. Moreover, it has also been observed from the literature review that studies have been carried out on various techniques that are required to achieve facial emotion recognition, but no literature has been found in the direction of designing and implementing a device (portable, cheap, and efficient) in real-time. This paper suggests the development of intelligence that can detect human faces and their emotions in real-time. A smart system and an IOT-based vision Mote device, which are designed for detecting the real-time behavior of a person. It is a small contribution to the social cause as the device is designed for detecting the real-time behavior of people under different situations. This device will automatically detect a human presence and capture the human face along with its facial emotions. Hence, it will collect real-time data and upload the captured emotions on the cloud, which can be accessed remotely.
The main aim behind this work is to develop a system that can understand human emotions at any point in time, irrespective of age, gender, and race. Moreover, successful efforts have been performed to make the system compact and cost-effective over existing heavy, costly, hefty, and complex facial emotion detection systems. In the future, the system can be implemented with the help of more powerful embedded boards that are available in the market, such as NVIDIA's Jetson Nano (NVIDIA, Leeds, UK) and Google Coral's Dev Board (Coral, Tuscaloosa, AL, USA). These boards may increase the cost a little but can make the existing system more efficient and capable of handling more complex Deep Neural Networks. However, to make the system maintenance, free solar batteries are also suggested. In this work, only those deep networks are optimized that can be easily deployable on Raspberry-Pi being a resource-constrained device, and the efficiency of 73% has been achieved, but in the future, with the help of embedded boards, various deep learning models can be used with better efficiency. The accuracy achieved via the propped system is sufficiently good as the system can measure physiological parameters of a human being via wrist band and, at the same time, capable of detecting facial emotions in real-time.