Masked Face Emotion Recognition Based on Facial Landmarks and Deep Learning Approaches for Visually Impaired People

Current artificial intelligence systems for determining a person’s emotions rely heavily on lip and mouth movement and other facial features such as eyebrows, eyes, and the forehead. Furthermore, low-light images are typically classified incorrectly because of the dark region around the eyes and eyebrows. In this work, we propose a facial emotion recognition method for masked facial images using low-light image enhancement and feature analysis of the upper features of the face with a convolutional neural network. The proposed approach employs the AffectNet image dataset, which includes eight types of facial expressions and 420,299 images. Initially, the facial input image’s lower parts are covered behind a synthetic mask. Boundary and regional representation methods are used to indicate the head and upper features of the face. Secondly, we effectively adopt a facial landmark detection method-based feature extraction strategy using the partially covered masked face’s features. Finally, the features, the coordinates of the landmarks that have been identified, and the histograms of the oriented gradients are then incorporated into the classification procedure using a convolutional neural network. An experimental evaluation shows that the proposed method surpasses others by achieving an accuracy of 69.3% on the AffectNet dataset.


Introduction
Understanding and responding to others' emotions is crucial to interpreting nonverbal cues and the ability to read another person's emotions, thoughts, and intentions. Humans use a variety of cues including voice intonation, word choice, and facial expression to interpret emotional states. Non-verbal cues, such as facial expressions, are essential in communication, but people who are blind or visually impaired are unable to perceive these cues [1]. Accurate emotion recognition is particularly important in social interactions because of its function in helping people communicate more effectively. For instance, how people react to their interactions with individuals is affected by the emotions they are experiencing. The inferential processes that are triggered by an emotional expression might then inform of subsequent thoughts and behaviors, as proposed by the emotion as social information paradigm [2]. If the observer notices that the person being observed is depressed because they cannot open the door, they may offer to assist. Recognizing people's emotions correctly is critical because each emotion demonstrates unique information and feelings. If people in a meeting cannot recognize each other's emotions, they may respond counterproductively to the attention.
Emotional mimicry, or the act of mirroring the nonverbal behaviors underlying an individual's emotional expressions [3,4], has been shown to boost a person's likeability and, consequently, the likelihood that they will like and be willing to form a relationship

•
The facial emotion recognition part of the smart glasses design was implemented to assist BVI people in understanding and communicating with people. Current smart glasses designs do not have the facial emotion recognition method in a low-light noise environment. It uses real-time audio results to inform users about their direct surroundings [22]; • We used a low-light image enhancement technique to solve the problem of misclassification in scenarios where the upper parts of the face are too dark or when the contrast is low; • To recognize facial emotion, specific facial landmark modalities employ the MediaPipe face mesh method [26]. The results indicate a dual role in facial emotion identification. Specifically, the model can identify emotional states in either masked or unmasked faces; • We created a CNN model with feature extraction, fully connected, SoftMax classification layers. The Mish activation function was adopted in each convolution layer. The use of Mish is a significant development that has the potential to enhance categorization precision.
This paper's remaining sections are structured as follows. We review existing facial emotion recognition methods in Section 2. Section 3 outlines the data collection and modification and describes the proposed masked facial emotion recognition approach. The experimental results and analysis are presented and discussed in Section 4. In Section 5, we discuss the limitations and outcomes of our study and suggest future directions.

Upper and Lower Parts of The Face
Based on previous research showing that facial expression recognition is decreased when a portion of the face is unobservable [27], it stands to reason that emotion recognition is affected by masking the face with various face coverings. Various current works [28][29][30][31][32] have shown this effect. Although visual manipulation makes it possible to standardize emotional indication between the mask and no-mask scenarios [30], it can introduce input artefacts that could interfere with emotion recognition. In a big smile, for instance, the top portion of a mask may rise and the widening of the lips in surprise may expand it vertically; these changes in the lower half of the face's features can be seen as aspects that help convey mood. Furthermore, a person's facial expression of emotion may shift when they cover their face. Photo editing to artificially set face covering can skew results and prevent a naturalistic study of the effects of masks on facial expression identification.
Recent studies that have examined masks and facial emotion recognition have found that wearing a mask reduces the accuracy of emotion recognition. However, this decrease in accuracy is not uniform across all facial expressions. For instance, facial emotion recognition insufficiencies for Happiness, Sadness, Disgust, and Anger were found, but not for Fear or Neutral emotions [29,30]. First, covering the lower features of the face, such as the mouth, cheeks, and nose, with masks has different effects on different facial expressions, as experimented with using "bubbles" in the study [33]. In addition, other approaches imply that the primary informative sections of the face vary between facial expressions [34,35]. In contrast, analyses of masking the face have shown differences throughout expressions in the outcomes of hiding the eye against mouth parts [36,37]. Figure 1 shows example images of seven facial emotions. imply that the primary informative sections of the face vary between facial expressions [34,35]. In contrast, analyses of masking the face have shown differences throughout expressions in the outcomes of hiding the eye against mouth parts [36,37]. Figure 1 shows example images of seven facial emotions. Studies based on bubbles have shown that the lower parts of the face provide the most details about a person's emotional condition when they are happy, surprised, or disgusted. The upper parts of the face provide the most details concerning a person's emotional condition when they are afraid or angry, and the lower and upper parts provide the same information when the person is sad or neutral [34,35]. The best uniform effect from comparing the coverage of the lower and upper regions of the face is that covering the lower part disrupts recognition of happiness more compared to covering the upper part. At the same time, other emotions have varying results: for instance, the authors of [38] observed that covering the mouth interrupted emotions of disgust and anger more than eye covering; however, the author of [37] found the reverse trend.

Facial Landmarks
Facial landmarks present valuable data for exploring facial expressions as shown in Figure 2. Yan et al. [39] proposed facial landmarks as action unit derivatives to describe face muscle motions. Other studies [40,41] have introduced a variety of landmark and image fusion processes. Hasani et al. [42] proposed a merging of videos and landmarks. The deformable synthesis model (DSM) was proposed by Fabiano et al. [43]. These algorithms demonstrate the effectiveness of the landmark feature; nonetheless, emotion identification algorithms employing landmark characteristics have been investigated rarely in recent years. This is not because the information offered by landmark features is insufficient, but rather because suitable techniques for extracting information from landmark features have yet to be chosen. Recently, Ngos et al. [44] introduced a graph convolutional neural network utilizing facial landmark features to identify points, and the edges of the graph were constructed by employing the Delaunay technique. Khoeun et al. [45] proposed a feature vector approach for recognizing emotions of masked faces with three key components. The authors used facial landmark identification to retrieve the characteristics of covered faces with masks, and upper facial landmark coordinates were used to identify facial expressions. Nair and Cavallaro [46] suggested a robust framework for detecting and segmenting facial landmark position to match face meshes to facial standards. First, face regions were segmented and landmark position was performed. Additionally, Hemang et al. [47] compared the 3D data of facial feature coordinates to the 2D coordinates acquired from a photo or live stream using Levenberg-Marquardt optimization and a Studies based on bubbles have shown that the lower parts of the face provide the most details about a person's emotional condition when they are happy, surprised, or disgusted. The upper parts of the face provide the most details concerning a person's emotional condition when they are afraid or angry, and the lower and upper parts provide the same information when the person is sad or neutral [34,35]. The best uniform effect from comparing the coverage of the lower and upper regions of the face is that covering the lower part disrupts recognition of happiness more compared to covering the upper part. At the same time, other emotions have varying results: for instance, the authors of [38] observed that covering the mouth interrupted emotions of disgust and anger more than eye covering; however, the author of [37] found the reverse trend.

Facial Landmarks
Facial landmarks present valuable data for exploring facial expressions as shown in Figure 2. Yan et al. [39] proposed facial landmarks as action unit derivatives to describe face muscle motions. Other studies [40,41] have introduced a variety of landmark and image fusion processes. Hasani et al. [42] proposed a merging of videos and landmarks. The deformable synthesis model (DSM) was proposed by Fabiano et al. [43]. These algorithms demonstrate the effectiveness of the landmark feature; nonetheless, emotion identification algorithms employing landmark characteristics have been investigated rarely in recent years. This is not because the information offered by landmark features is insufficient, but rather because suitable techniques for extracting information from landmark features have yet to be chosen. Recently, Ngos et al. [44] introduced a graph convolutional neural network utilizing facial landmark features to identify points, and the edges of the graph were constructed by employing the Delaunay technique. Khoeun et al. [45] proposed a feature vector approach for recognizing emotions of masked faces with three key components. The authors used facial landmark identification to retrieve the characteristics of covered faces with masks, and upper facial landmark coordinates were used to identify facial expressions. Nair and Cavallaro [46] suggested a robust framework for detecting and segmenting facial landmark position to match face meshes to facial standards. First, face regions were segmented and landmark position was performed. Additionally, Hemang et al. [47] compared the 3D data of facial feature coordinates to the 2D coordinates acquired from a photo or live stream using Levenberg-Marquardt optimization and a projection matrix. By employing this strategy, the authors could identify the ideal landmarks and calculate the Euler angles of the face. projection matrix. By employing this strategy, the authors could identify the ideal landmarks and calculate the Euler angles of the face. Various methods describe emotions based on a mixture of certain facial features, including the upper and lower features of the face. Existing methods that depend solely on action units are restricted by the need for more information from the bottom features of the face, resulting in a reduction in accuracy. Table 1 provides a comparison of the available techniques.

Models
Facial Features Emotions Datasets Recognition in Dark ExNet [48] Upper and lower 7 FER-2013, CK+, RAF-DB No Shao et al. [49] Upper and lower 7 CK+, FER-2013 No Miao et al. [50] Upper and lower 7 FER2013, CASME II, SAMM No Wang et al. [51] Upper and lower 8 FERPlus, AffectNet, RAF-DB No Farzaneh et al. [52] Upper and lower 7 RAF-DB, AffectNet No Shi et al. [53] Upper and lower 8 RAF-DB, AffectNet No Li et al. [54] Upper and lower 7 RAF-DB No Li et al. [55] Upper and lower 7 RAF-DB, AffectNet No Khoeun et al. [ Current facial emotion recognition algorithms explained in [53][54][55] that rely on standard 68-landmark detection involve searching the whole picture to locate the facial contours and then labeling the face with the positions of the 68 landmarks. The links between these landmarks are then analyzed. However, these approaches rely heavily on the interaction between the bottom and top features of the face; hence, accomplishment is interrupted when the bottom features of the face are invisible, resulting in roughly 40% of the information being unavailable. For these face-based algorithms, every pixel in the identified faces is utilized to learn and categorize emotions, resulting in a significant degree of computing complexity and time. Face-based approaches have the disadvantage of utilizing all pixels, which are irrelevant to the operation. Furthermore, these unnecessary pixels interrupt the process of training, resulting in low accuracy and great complexity.

Datasets for Facial Emotion Recognition
The original MultiPie [57], Lucey et al. [58], Lyons et al. [59], and Pantic et al. [60] datasets of facial expressions were recorded in a laboratory setting, with the individuals acting out a variety of facial expressions. Using this method, we created a spotless, highquality repository of staged facial expressions. Faces in pictures may look different from their unposed (or "spontaneous") counterparts. Therefore, recording emotions as they Various methods describe emotions based on a mixture of certain facial features, including the upper and lower features of the face. Existing methods that depend solely on action units are restricted by the need for more information from the bottom features of the face, resulting in a reduction in accuracy. Table 1 provides a comparison of the available techniques. Current facial emotion recognition algorithms explained in [53][54][55] that rely on standard 68-landmark detection involve searching the whole picture to locate the facial contours and then labeling the face with the positions of the 68 landmarks. The links between these landmarks are then analyzed. However, these approaches rely heavily on the interaction between the bottom and top features of the face; hence, accomplishment is interrupted when the bottom features of the face are invisible, resulting in roughly 40% of the information being unavailable. For these face-based algorithms, every pixel in the identified faces is utilized to learn and categorize emotions, resulting in a significant degree of computing complexity and time. Face-based approaches have the disadvantage of utilizing all pixels, which are irrelevant to the operation. Furthermore, these unnecessary pixels interrupt the process of training, resulting in low accuracy and great complexity.

Datasets for Facial Emotion Recognition
The original MultiPie [57], Lucey et al. [58], Lyons et al. [59], and Pantic et al. [60] datasets of facial expressions were recorded in a laboratory setting, with the individuals acting out a variety of facial expressions. Using this method, we created a spotless, highquality repository of staged facial expressions. Faces in pictures may look different from their unposed (or "spontaneous") counterparts. Therefore, recording emotions as they happen became popular among researchers in affective computing. Situations such as this include experiments in which participants' facial reactions to stimuli are recorded [60][61][62] or emotion-inducing activities are conducted in a laboratory [63]. These datasets often record a sequence of frames that researchers may use to study expressions' temporal and dynamic elements, including capturing multi-modal impacts such as speech, bodily signals, and others. However, the number of individuals, the range of head poses, and the settings in which these datasets were collected all contribute to a lack of variety. Therefore, it is necessary to create methods based on natural, unstaged presentations of emotion. In order to meet this need, researchers have increasingly focused on realworld datasets. Table 1 provides a summary of the evaluated databases' features across all three affect models: facial action, dimensional model, and category model. In 2017, Mollahosseini et al. [24] created a facial emotion dataset named AffectNet to develop an emotion recognition system. This dataset is one of the largest facial emotion datasets of the categorical and dimensional models of affect in the real world. After searching three of the most popular search engines with 1250 emotion-related keywords in six languages, AffectNet gathered over a million photos of people's faces online. The existence of seven distinct facial expressions and the strength of valence and arousal were manually annotated in roughly half of the obtained photos. AffectNet is unrivalled as the biggest dataset of natural facial expressions, valence, and arousal for studies on automated facial expression identification. The pictures have an average 512 by 512 pixel resolution. The pictures in the collection vary significantly in appearance; there are both full color and gray-scale pictures, and they range in contrast, brightness, and background variety. Furthermore, the people in the frame are mostly frontally portrayed, although items such as sunglasses, hats, hair, and hands may obscure the face. As a result, the dataset adequately describes multiple scenarios as it covers a wide variety of real-world situations.
In the ICML 2013 Challenges in Representation Learning [64], the Facial Expression Recognition 2013 (FER-2013) [65] database was first introduced. The database was built by matching a collection of 184 emotion-related keywords to images using the Google Image Search API, which allowed capturing the six fundamental and neutral expressions. Photos were downscaled to 48 × 48 pixels and converted to grayscale. The final collection includes 35,887 photos, most of which were taken in natural real-world scenarios. Our previous work [56] used the FER-2013 dataset because it is one of the largest publicly accessible facial expression datasets for real-world situations. However, only 547 of the photos in FER-2013 depict emotions such as distaste, and most facial landmark detectors are unable to extract landmarks at this resolution and quality due to the lack of face registration. Additionally, FER-2013 only provides the category model of emotion.
Mehendale [66] proposed a CNN-based facial emotion recognition and changed the original dataset by recategorizing the images into the following five categories: Anger-Disgust, Fear-Surprise, Happiness, Sadness, and Neutral; the Contempt category was removed. The similarities between the Anger-Disgust and Fear-Surprise facial expressions in the top part of the face provide sufficient evidence to support the new categorization. For example, when someone feels angry or disgusted, their eyebrows will naturally lower, whereas when they are scared or surprised, their eyebrows will raise in unison. The deletion of the contempt category may be rationalized because (1) it is not a central emotion in communication and (2) the expressiveness associated with contempt is localized in the mouth area and is thus undetectable if the individual is wearing a face mask. The dataset is somewhat balanced as a result of this merging process.
In this study, we used the AffectNet [24] dataset to train an emotional recognition model. Since the intended aim of this study is to determine a person's emotional state even when a mask covers their face, the second stage was to build an appropriate dataset in which a synthetic mask was attached to each individual's face. To do this, the MaskTheFace algorithm was used. In a nutshell, this method determines the angle of the face and then installs a mask selected from a database of masks. The mask's orientation is then fine-tuned by extracting six characteristics from the face [67]. The characteristics and features of existing facial emotion recognition datasets are demonstrated in Table 2.

Proposed Method for Facial Emotion Recognition
Our aim is to enhance the quality of life for the BVI people by making it easier for them to communicate socially with other human beings, both during the day and at night. Wearable smart glasses and a multipurpose system able to record pictures through a tiny camera and provide facial emotion recognition results using audio data to BVI people are the most practical means of achieving this aim. The system needs a solid CPU to quickly run deep CNN models for real-time emotion recognition. As a result, we proposed a clientserver architecture wherein smart glasses and a smartphone are client devices while an AI server processes input video frames. The work proposed here presents a comprehensive, two-part deep learning framework for use in all stages of the learning process. The general design of the proposed system is shown in Figure 3. The local component uses Bluetooth to transmit data between a smartphone and the smart glasses. Meanwhile, the recorded images are transferred to the AI server, which processes them and then plays them back to the user as an audio file. Keep in mind that the hardware for smart glasses has both a built-in speaker for direct output and an earphone connector for the audio connection, allowing users to hear the responses of voice feedback sent from their smartphones.
Sensors 2023, 23, 1080 7 of 24 then fine-tuned by extracting six characteristics from the face [67]. The characteristics and features of existing facial emotion recognition datasets are demonstrated in Table 2.

Proposed Method for Facial Emotion Recognition
Our aim is to enhance the quality of life for the BVI people by making it easier for them to communicate socially with other human beings, both during the day and at night. Wearable smart glasses and a multipurpose system able to record pictures through a tiny camera and provide facial emotion recognition results using audio data to BVI people are the most practical means of achieving this aim. The system needs a solid CPU to quickly run deep CNN models for real-time emotion recognition. As a result, we proposed a client-server architecture wherein smart glasses and a smartphone are client devices while an AI server processes input video frames. The work proposed here presents a comprehensive, two-part deep learning framework for use in all stages of the learning process. The general design of the proposed system is shown in Figure 3. The local component uses Bluetooth to transmit data between a smartphone and the smart glasses. Meanwhile, the recorded images are transferred to the AI server, which processes them and then plays them back to the user as an audio file. Keep in mind that the hardware for smart glasses has both a built-in speaker for direct output and an earphone connector for the audio connection, allowing users to hear the responses of voice feedback sent from their smartphones. The client-side workflow entails the following steps: first, the user establishes a Bluetooth connection between their smart glasses and a smartphone. Once this is done, the user may instruct the smart glasses to take pictures, and the smartphone will then receive those pictures. This situation better serves smart glasses' power needs than a continuous video recording. The AI server then provides spoken input through headphones, a speaker system, or a mobile device. Despite the recent introduction of light deep CNN models, we still conducted face expression recognition tasks on an AI server rather than The client-side workflow entails the following steps: first, the user establishes a Bluetooth connection between their smart glasses and a smartphone. Once this is done, the user may instruct the smart glasses to take pictures, and the smartphone will then receive those pictures. This situation better serves smart glasses' power needs than a continuous video recording. The AI server then provides spoken input through headphones, a speaker system, or a mobile device. Despite the recent introduction of light deep CNN models, we still conducted face expression recognition tasks on an AI server rather than a wearable assistive device or smartphone CPU. The fact that smart glasses and smartphones are solely used for taking pictures also helps them last longer on a single charge.
The AI server is comprised of three primary models: (1) an image enhancement model for low-contrast and low-light conditions; (2) a facial emotion recognition model; and (3) a text-to-speech model for converting text results to audio results. In addition, the AI server component has two modes of operation, day and night, which are activated at different times of the day and night, respectively. The low-light picture-enhancing model does not operate in daylight mode. The following is how the nighttime mode operates: After receiving a picture from a smartphone, the system initially processes it with a lowlight image improvement model to improve the image's dark-area quality and eliminate background noise. After the picture quality has been enhanced, facial emotion recognition models are applied to recognize masked and unmasked faces, and a text-to-speech model is performed. The AI server sends back the audio results in response to the client's request.

Low-Contrast Image Enhancement Model
Pictures captured in low contrast are characterized by large areas of darkness, blurred details, and surprising noise levels compared to similarly composed images captured in standard lighting. This can happen if the cameras are not calibrated properly or if there is very little light in the scene, as in the nighttime or a low-contrast environment. Thus, the quality of such pictures is poor because of the insufficient processing of information required to develop sophisticated applications such as facial emotion detection and recognition. As a result, this subfield of computer vision is one of the most beneficial in the field and has drawn the interest of many scientists because of its significance in both basic and advanced uses, such as assistive technologies, autonomous vehicles, visual surveillance, and night vision.
An ideal and successful approach would be to deploy a low-light image enhancement model to allow the facial emotion recognition model to autonomously work in a dark and low-contrast environment. Recently, a deep learning-based model for improving low-light images has shown impressive accuracy while successfully eliminating a wide range of noises. For this reason, we implemented a low-light image improvement model using a CNN-based two-branch exposure-fusion network [26]. In the first stage of the low-light improvement process, a two-branch illumination enhancement framework is implemented, with two different enhancing methodologies used separately to increase the potential. An information-driven preprocessing module was included to lessen the deterioration in extremely low-light backgrounds. In the second stage, these two augmentation modules were sent into the fusion part, which was trained to combine them using a useful attention strategy and a refining technique. Figure 4 depicts the network topology of a two-node exposure-fusion system [23]. Lu and Zhang evaluated the two branches as −1E and −2E because the top branch is more beneficial for images in the measurement approach with only a level of exposure of −1E, while the second branch is more effective for images with an exposure level of −2E.
F branch en independently creates the −1E branch and the main structure of the −2E branch without requiring an additional denoising procedure. The improvement module's output is depicted as follows: where branch ∈ {−1E, −2E}. The input and output pictures are denoted by I in and I out . Initially, four convolutional layers are applied to the input picture to extract its additional features, which are then concatenated with the input low-light images before being fed into this improvement module [23]. independently creates the −1E branch and the main structure of the −2E branch without requiring an additional denoising procedure. The improvement module's output is depicted as follows: where branch ∈ {−1E, −2E}. The input and output pictures are denoted by Iin and Iout. Initially, four convolutional layers are applied to the input picture to extract its additional features, which are then concatenated with the input low-light images before being fed into this improvement module [23]. The −2E training branch is used to teach this component to identify the degree of picture degradation caused by factors such as natural noise in the preprocessing module. In order to explain the preprocessing module's functionality, multilayer element-wise summations were used. The feature maps from the fifth convolutional layer, which used a filter size of 3 by 3, were added to the feature maps from the preceding layers to facilitate training. In addition, no activation function was utilized after the convolution layer; the input characteristics were scaled down to [0,1] using the modified ReLU function in the last layer. To be able to recreate the intricate patterns even in a dark environment, the predicted noise range was adjusted to (−∞, +∞).
In the fusion module, the two-branch network improved results were integrated with the attention unit and then refined in a separate unit. For the -1E improved picture, the attention unit used four convolutional layers to produce the attention map S = Fatten(I′), whereas the −2E image received the corresponding element 1 − S, with S(x, y) ∈ [0,1]. By adjusting the weighted template, this approach seeks to continually aid in the development of a self-adaptive fusion technique. With the help of the focus map, we can see that the R, G, and B channels are all given the same amount of consideration. The following is the computed outcome from the Iatten attention unit: While this straightforward method produces improved pictures from both the −1E and −2E branches, there is a risk that some crucial details may be lost during the fusion process. Furthermore, noise levels may rise if a direct metric is employed. In order to fix The −2E training branch is used to teach this component to identify the degree of picture degradation caused by factors such as natural noise in the preprocessing module. In order to explain the preprocessing module's functionality, multilayer element-wise summations were used. The feature maps from the fifth convolutional layer, which used a filter size of 3 by 3, were added to the feature maps from the preceding layers to facilitate training. In addition, no activation function was utilized after the convolution layer; the input characteristics were scaled down to [0,1] using the modified ReLU function in the last layer. To be able to recreate the intricate patterns even in a dark environment, the predicted noise range was adjusted to (−∞, +∞).
In the fusion module, the two-branch network improved results were integrated with the attention unit and then refined in a separate unit. For the −1E improved picture, the attention unit used four convolutional layers to produce the attention map S = F atten (I ), whereas the −2E image received the corresponding element 1 − S, with S(x, y) ∈ [0,1]. By adjusting the weighted template, this approach seeks to continually aid in the development of a self-adaptive fusion technique. With the help of the focus map, we can see that the R, G, and B channels are all given the same amount of consideration. The following is the computed outcome from the I atten attention unit: While this straightforward method produces improved pictures from both the −1E and −2E branches, there is a risk that some crucial details may be lost during the fusion process. Furthermore, noise levels may rise if a direct metric is employed. In order to fix this, I atten was combined with its low-light input and delivered to the F ref refining unit. The final improved picture formulization is as follows: In this training, smooth, VGG, and SSIM loss functions were used. It is possible to utilize total variation loss to characterize the smoothness of the predicted transfer function in addition to its structural properties when working with a smooth loss. Smooth loss is calculated using Equation (5) and the per-pixel horizontal and vertical dissimilarity is denoted by ∇ x,y .
VGG loss was employed to solve two distinct issues. First, according to [23], when two pixels are bound with pixel-level distance, one pixel may take the value of any pixels within the error radius. This tolerance for probable changes in hues and color depth is what makes this limitation so valuable. Second, pixel-level loss functions do not accurately represent the intended quality when the ground truth is generated by a combination of several commercially available enhancement methods. In the following, Equation (6) was used to calculate VGG loss.
where W, H, and C represent a picture's width, height, and depth, respectively. Specifically, the mean squared error was used to evaluate the gap between these elements. In this case, SSIM loss outperforms L1 and L2 as loss functions because it simultaneously measures the brightness, contrast, and structural diversity. The following is an equation (Equation (7)) of the SSIM loss function: These three loss functions are combined to form Equation (8) that is as follows: Two popular image datasets [72,73] were used to train the low-light image improvement model. During individual −1E and −2E branch training, CC was first set to zero and then gradually increased to 0.1 in the joint training phase. DD, in contrast, was held constant at 0.1 throughout the process. Each dataset was split into a training set and an evaluation set. The results of the low-light image enhancement model are illustrated in Figure 5. The image enhancement algorithm's dark lighting results were fed into a model for recognizing facial expressions.

Recognizing Emotions from Masked Facial Images
Most studies explore seven types of facial emotions: Happiness, Sadness, Anger, Surprise, Disgust, Neutral, and Fear, as displayed in Figure 6. Eye and eyebrow shape and pattern may help differentiate "Surprise" from the other emotions while the bottom fea-

Recognizing Emotions from Masked Facial Images
Most studies explore seven types of facial emotions: Happiness, Sadness, Anger, Surprise, Disgust, Neutral, and Fear, as displayed in Figure 6. Eye and eyebrow shape and pattern may help differentiate "Surprise" from the other emotions while the bottom features of the face (cheeks, nose, and lips) are absent. It is hard to describe the difference between "Anger" and "Disgust". These two expressions can be confused because the top features of the face are almost the same as the bottom features; these two emotions may be correctly identified using a large wild dataset. The lower parts of the face are significant in conveying happiness, fear, and sadness, making it difficult to tell these correct emotions without them.

Recognizing Emotions from Masked Facial Images
Most studies explore seven types of facial emotions: Happiness, Sadness, Anger, Surprise, Disgust, Neutral, and Fear, as displayed in Figure 6. Eye and eyebrow shape and pattern may help differentiate "Surprise" from the other emotions while the bottom features of the face (cheeks, nose, and lips) are absent. It is hard to describe the difference between "Anger" and "Disgust." These two expressions can be confused because the top features of the face are almost the same as the bottom features; these two emotions may be correctly identified using a large wild dataset. The lower parts of the face are significant in conveying happiness, fear, and sadness, making it difficult to tell these correct emotions without them.   [45]. An illustration of the facial emotion recognition approach from the bottom part of the face-masked images is shown in Figure 7. work of Khoeun et al. [45]. An illustration of the facial emotion recognition approach from the bottom part of the face-masked images is shown in Figure 7.

Generating and Detecting Synthetic Masked Face
This stage involved carrying the original facial emotion images from the AffectNet [24] dataset and utilizing it to build masked face images with artificial masks. To isolate the human face from an image's background, we used MediaPipe face detection [74]. The face region size in each image was extracted correspondingly. Subsequently, the bottom part of the identified face, such as the cheeks, nose, and lips, were masked using the MaskTheFace approach [25], as illustrated in Figure 8. This created face images with artificial masks that were as realistic as possible. In the end, the top feature of the masked face picture was applied for further processing. Across the AffectNet dataset, the typical picture size was 425 pixels in both dimensions.

Generating and Detecting Synthetic Masked Face
This stage involved carrying the original facial emotion images from the AffectNet [24] dataset and utilizing it to build masked face images with artificial masks. To isolate the human face from an image's background, we used MediaPipe face detection [74]. The face region size in each image was extracted correspondingly. Subsequently, the bottom part of the identified face, such as the cheeks, nose, and lips, were masked using the MaskTheFace approach [25], as illustrated in Figure 8. This created face images with artificial masks that were as realistic as possible. In the end, the top feature of the masked face picture was applied for further processing. Across the AffectNet dataset, the typical picture size was 425 pixels in both dimensions.

Infinity Shape Creation
We set out to solve the problem of obstructed lower facial features by creating a fast facial landmark detector. We found that during emotional expression, the uncovered areas of the face (the eyes and eyebrows) became wider in contrast to the obscured areas (the lips, cheeks, and nose) [45]. To further aid in emotion classification, we aimed to

Infinity Shape Creation
We set out to solve the problem of obstructed lower facial features by creating a fast facial landmark detector. We found that during emotional expression, the uncovered areas of the face (the eyes and eyebrows) became wider in contrast to the obscured areas (the lips, cheeks, and nose) [45]. To further aid in emotion classification, we aimed to implement a landmark identifier and select the face details vectors that indicate the crucial connections among those areas. We implemented this step to guarantee that the produced points covered the eyebrows and eyes. Instead of using the complete pixel region for training and classifying distinct emotions, it is necessary to identify the significant properties that occur between the lines linking neighborhood locations. Consequently, the computational complexity is drastically decreased.

Normalizing Infinity Shape
The initial collection of points used to produce the infinity shaper is a different scale than the upper part of the face image. Before placing the initial infinity shape in its final place, it must be scaled to the correct dimensions so that it adequately covers the whole upper part of the face. The original x or y coordinate value is transformed into a different range or size according to each upper part of the face, as indicated in Equations (9) and (10), which allow us to determine the new coordinates of every position. Here, the upper part of the face width or height is the new range (max_new and min_new). As a result, the x and y coordinates were normalized, where the infinity shape's size was comparable to that of the upper part of the face. Moreover, this is one of the many adaptable features of the method. Each area of the upper part of the face is measured, and then the original position set is normalized based on the average of these measurements.

Landmark Detection
In order to identify masked and unmasked faces, we created a landmark detection technique. At this point, we used a model that detects landmarks on both masked and unmasked faces. Following this, MediaPipe was used to construct a deep learning framework. MediaPipe is a toolkit for developing machine learning pipelines for processing video streams. The MediaPipe face-mesh model [26] we used in our study calculates 468 3D facial landmarks, as displayed in Figure 9.
range or size according to each upper part of the face, as indicated in Equations (9) and (10), which allow us to determine the new coordinates of every position. Here, the upper part of the face width or height is the new range (max_new and min_new). As a result, the x and y coordinates were normalized, where the infinity shape's size was comparable to that of the upper part of the face. Moreover, this is one of the many adaptable features of the method. Each area of the upper part of the face is measured, and then the original position set is normalized based on the average of these measurements.

Landmark Detection
In order to identify masked and unmasked faces, we created a landmark detection technique. At this point, we used a model that detects landmarks on both masked and unmasked faces. Following this, MediaPipe was used to construct a deep learning framework. MediaPipe is a toolkit for developing machine learning pipelines for processing video streams. The MediaPipe face-mesh model [26] we used in our study calculates 468 3D facial landmarks, as displayed in Figure 9.  Using transfer learning, researchers at Facebook created a machine-learning model called MediaPipe face mesh [26]. MediaPipe face mesh was developed with the goal of recognizing the three-dimensional topology of a user's face. The network architecture of the face mesh was developed on top of the blaze face [75] concept. The blaze face model's primary function is detecting and estimating faces inside a given picture or video frame using bounding boxes. Face mesh estimates 3D coordinates after blaze face bounding boxes have been used to encircle the face. Two versions of deep neural networks run in real-time and make up the pipeline. The first part is a face detector that processes the entire image and calculates where faces are located. This second part is the 3D face landmark model that uses these points to construct a regression-based approximation of the 3D face surface.
To extract the eye-related features, it is necessary first to find the region of interest of both the right and left eyes. Every localization of a landmark in a facial muscle has a powerful connection to other landmarks in the same muscle or neighboring muscles. In this study, we observed that exterior landmarks had a detrimental impact on the accuracy of face emotion identification. Therefore, we employed the MediaPipe face mesh model to identify landmarks for the upper features of the face, such as eyes and eyebrows, where landmarks were input characteristics, intending to improve the model's performance. In the following (Equation (10)), the facial landmarks are calculated: where FL describes a set of facial landmarks and x l, f , y l, f are the locations of each facial landmarks. L and F indicate the number of facial landmarks and image frames, respectively.

Feature Extraction
In this step, we evaluated and selected important features of the upper features of the face. Figure 9 demonstrates that the majority of the observed upper features of facial landmarks are within the confines of the eye and eyebrow regions. The link between the identified landmarks and the landmarks' individual coordinates are important elements for the categorization procedure. There are still some outliers, so the identified locations are treated as potential landmarks. In addition to completing the feature extraction procedure, this stage aims to eliminate the unimportant elements. To do this, we applied the histograms of the oriented gradients (HOG) from Equations (11)- (14) to all of the potential landmarks on the upper features of the face to obtain the orientation of the landmarks and their relative magnitudes.
Each landmark's coordinates consist of its location (y and x) and 72 values representing the directional strength and blob size (20 by 20 pixels) all over the facial landmark. As a result, each landmark can include the 74-feature vector. In this way, the meaningful data collected from each landmark will include details on the connections between the many points of interest in a small area. Among qualitative identifiers, the HOG's goal is to normalize attributes of every item, such that the identical items always yield the same feature identifier regardless of context. The HOG uses significant values in the local gradient vectors in duration. It uses the standardization of local histograms, the image's gradient, and a set of histogram orientations for several places. It is also essential to restrict the local histograms to the region of the block. The basic assumption is that the distribution of border direction (local intensity gradients) determines local object appearance and shape, even if there are no hints to explicate the border placements or the corresponding gradient. The HOG characteristics that have been discovered for each landmark are combined into one vector. When a landmark's x and y locations are combined with the HOG feature vector, the resulting information is assumed to be representational of that landmark. Before moving on to the classification stage, each image's landmark information with a distinct emotion is categorized.

Emotion Classification
In order to classify emotions, information from all landmarks is gathered collectively and sent into the training stage. We used CNNs, the Mish activation function, and the Softmax function for classification, as seen in Figure 10. Due to the many locations and characteristics in each facial image, all of the data are taken into account within a single training frame. The network architecture of a CNN design optimized for recognizing different facial emotions is shown in Figure 10. CNNs rely on their central convolution layer to best represent local connections and value exchange qualities. Each convolution layer was generated using the input picture and many trainable convolution filter methods, including the batch normalization approach, Mish activation function, and max pooling parameters, all of which were also used in the feature extraction of the emotion recognition model. training frame. The network architecture of a CNN design optimized for recognizing different facial emotions is shown in Figure 10. CNNs rely on their central convolution layer to best represent local connections and value exchange qualities. Each convolution layer was generated using the input picture and many trainable convolution filter methods, including the batch normalization approach, Mish activation function, and max pooling parameters, all of which were also used in the feature extraction of the emotion recognition model. The batch normalization method was employed to decrease the time spent on learning by normalizing the inputs to a layer and maintaining the learning process of the algorithms. Without the activation function, the CNN model has the characteristics of a simple linear regression model. Therefore, the Mish activation function was employed in the network to learn complicated patterns in the image data. Furthermore, to reach the substantially more nuanced view region afforded by deep description, researchers need an activation function to build nonlinear connections between inputs and outputs. Even though leaky ReLU is widely used in deep learning, Mish often outperforms it. The use of Mish is a significant development that has the potential to enhance categorization precision. In the following (Equation (15)), the Mish activation function is described: The leaky ReLU activation function is calculated as follows: The maximum value in each area of the facial feature map was then determined using a max-pooling process. We reduced the feature map's dimensionality during the pooling process by moving a two-dimensional filter over each feature map. To avoid the need for exact feature positioning, the pooling layer summed the features contained in an area of the feature map created by the convolution layer. By reducing the number of dimensions, the model becomes less sensitive to shifts in the locations of the elements in the input data. The batch normalization method was employed to decrease the time spent on learning by normalizing the inputs to a layer and maintaining the learning process of the algorithms. Without the activation function, the CNN model has the characteristics of a simple linear regression model. Therefore, the Mish activation function was employed in the network to learn complicated patterns in the image data. Furthermore, to reach the substantially more nuanced view region afforded by deep description, researchers need an activation function to build nonlinear connections between inputs and outputs. Even though leaky ReLU is widely used in deep learning, Mish often outperforms it. The use of Mish is a significant development that has the potential to enhance categorization precision. In the following (Equation (15)), the Mish activation function is described: The leaky ReLU activation function is calculated as follows: The maximum value in each area of the facial feature map was then determined using a max-pooling process. We reduced the feature map's dimensionality during the pooling process by moving a two-dimensional filter over each feature map. To avoid the need for exact feature positioning, the pooling layer summed the features contained in an area of the feature map created by the convolution layer. By reducing the number of dimensions, the model becomes less sensitive to shifts in the locations of the elements in the input data. The final layer of the proposed CNN model utilized a Softmax classifier, which can predict seven emotional states, as illustrated in Figure 10.

Experimental Results and Analysis
This part details the methodology used to test the facial emotion recognition model and the outcomes obtained. Facial emotion recognition datasets were utilized for both training and testing purposes. Significant features for the training stage include a learning rate of 0.001, a batch size of 32 pixels, and a subdivision of 8. Investigating the classifier's performance is essential for developing a system that can consistently and correctly categorize facial emotions from masked facial images. This research examines and analyzes the performance of the proposed and other nine facial emotion recognition models for the purpose of accuracy comparison. In evaluation, it was determined that the proposed model accurately recognizes more facial emotions than other models. The results indicate that the proposed model successfully recognizes emotion in the wild. Both qualitative and quantitative analyses were employed to determine the experimental results.
To conduct this study, we utilized the AffectNet [24] database. The AffectNet database is one of the biggest and widely used natural facial expression emotion datasets for wild facial emotion recognition models and consists of 420,299 manually annotated facial images. Following the lead of the vast majority of previous studies [52,[75][76][77][78][79][80], we chose to train our models using seven emotions and leave contempt out of the scope (more than 281,000 training images). We conducted our analyses using the validation set, which consisted of 500 images for each emotion and 3500 images in total. We randomly selected pictures for the test set from the remaining dataset. It is worth noting that the deep learning model is trained from original images by scaling them into a size of 512 by 512 without cropping the images of faces, so much of the surrounding elements, hair, hats, or hands are present. We mainly did this to ensure that the images are wild and similar to real-life scenarios. During the training process, we employed the stochastic gradient descent optimizer.
Images from this widely used database were chosen due to their wide variety of subjects close to real-world illumination, head orientation, interferences, gender, age range, and nationality. The AffectNet database, for example, includes people aged 2 to 65 from a range of countries. Approximately 60% of the photos are of women, giving a rich diversity of perspectives and a challenging task. It allowed us to investigate real-world photographs of people's faces which are rich with anomalies. Online web sources were searched for the images. Two annotators each marked a set of images with simple and complex keywords. In this study, we only utilized the standard emotion images for the seven fundamental emotions: Happiness, Sadness, Anger, Surprise, Disgust, Neutral, and Fear. All of the photos were 256 pixels per side after being aligned. Facial hair, age, gender, nationality, head position, eyeglasses and other forms of occlusion, and the effect of illumination and other post-processing techniques, including different image filters, all contribute to a great deal of visual diversity throughout the available pictures. The participants in this data set were Asian, European, African American, and Caucasian. The dataset comprised individuals aged 2-5, 6-18, 19-35, 36-60, and 60+. Instead of relying on an embedded approach, which may not be the best choice for boosting the power storage viability of smart glasses and securing the system's realtime execution, it is preferable to use a high-performance AI server [22]. Whether or not the designed smart glasses system is implemented depends on how well the AI server executes. This is because smart glasses designs often use deep learning models, which need significant processing power on the AI server. All experiments were run on an AI server, and the specifications of the server are summarized in Table 3. A client component, including smart glasses and a smartphone, sent acquired photos to the AI server. After that, computer vision and deep learning models were used to process the input photos. The outcomes were transmitted to the client component via Wi-Fi/Internet so the user may listen to the output sounds over headphones/speakers. The following are the qualitative and quantitative experimental outcomes of the deep learning models implemented on the AI server.

Qualitative Evaluation
Initially, we assessed the proposed facial emotion recognition model qualitatively. We used a computer's web camera for a live video stream for this experiment. Figure 11 shows the classification results of the facial emotion recognition model from a masked face in a real-time environment. In the upper left corner of the image, the percentage of classification along with the recognized facial emotion is shown, and on the right side of the mask, the class of the facial emotion is illustrated.
cess the input photos. The outcomes were transmitted to the client component via Wi-Fi/Internet so the user may listen to the output sounds over headphones/speakers. The following are the qualitative and quantitative experimental outcomes of the deep learning models implemented on the AI server.

Qualitative Evaluation
Initially, we assessed the proposed facial emotion recognition model qualitatively. We used a computer's web camera for a live video stream for this experiment. Figure 11 shows the classification results of the facial emotion recognition model from a masked face in a real-time environment. In the upper left corner of the image, the percentage of classification along with the recognized facial emotion is shown, and on the right side of the mask, the class of the facial emotion is illustrated. This demonstrates that the proposed facial emotion recognition model successfully labelled seven facial expressions. As a programmable module, it may be included in smart glasses [31] to help BVI people to perceive people's emotions in social communication.

Quantitative Evaluation Using AffectNet Dataset
This work evaluated the classification approaches for facial emotion recognition using quantitative evaluation procedures. Quantitative experiments were carried out, and the results were analyzed using widespread object detection and classification evaluation This demonstrates that the proposed facial emotion recognition model successfully labelled seven facial expressions. As a programmable module, it may be included in smart glasses [31] to help BVI people to perceive people's emotions in social communication.

Quantitative Evaluation Using AffectNet Dataset
This work evaluated the classification approaches for facial emotion recognition using quantitative evaluation procedures. Quantitative experiments were carried out, and the results were analyzed using widespread object detection and classification evaluation metrics such as accuracy, precision, sensitivity, recall, specificity, and F-score, as mentioned in previous works [81][82][83][84]. Precision measures how well a classifier can separate relevant data from irrelevant data or the proportion of correct identifications. The percentage of true positives measures a model's precision in spotting relevant circumstances it identifies within all ground truths. The suggested method was compared to its findings using ground-truth images at the pixel level, and precision, recall, and F-score metrics were calculated. Accuracy (AC), precision (PR), sensitivity (SE), specificity (SP), recall (RE), and F-score (FS) metrics for the facial emotion recognition systems were determined using the following equations: where TP = number of true positive samples; TN = number of true negative samples; FP = number of false positive samples; FN = number of false negative samples; and C = number of categories. We also calculated the F-score value, which measures how well accuracy and recall are balanced. It is also named the F1 Score or the F Measure. The F-score was defined as follows, where a larger value indicates better performance: Table 4 shows comparison results of the proposed and other models. In general, as indicated in Table 4, the proposed model obtained the highest result, with 69.3% accuracy for the images derived from the AffectNet dataset. Figure 9 also compares the precision, recall, F-score, and accuracy of the proposed model's application to the AffectNet datasets. The performance in Table 4 confirms that the proposed model ranks first on the AffectNet dataset, with an accuracy of 69.3%. This was followed by the models of Farzaneh et al. [52] and Shi et al. [53], which obtained an accuracy of 65.2% (a difference of 4.1% from the proposed model).
According to our findings, identifying people's emotional expressions when a mask covers the lower part of the face is more problematic than when both parts of the face are available. As expected, we discovered that the accuracy of expression detection on a masked face was lower across all seven emotions (Happiness, Sadness, Anger, Fear, Disgust, Neutral, and Surprise). Table 5 and Figure 12 illustrate the results of the suggested emotion recognition method across seven different emotion categories. Facial landmarks such as eyebrows and eyes move and change more in Happiness and Surprise; therefore, the results for these emotions are the most accurate, with 90.3% and 80.7% accuracy, respectively. However, Fear, Sadness, and Anger resulted in minor adjustments to the eyebrow and eye landmarks, obtaining of 45.2%, 54.6%, and 62.8% accuracy, respectively. Although the landmark features associated with the emotions of Fear and Disgust (the eyebrows and eyes) are identical, an accuracy of 72.5% was reached.

Evaluation Based on Confusion Matrix
Furthermore, as seen in Figure 13, the suggested model was assessed by employing a confusion matrix for facial emotion recognition. For each of the seven categories, the authors chose 100 facial emotion photos randomly. Roughly 80% of randomly chosen photos depict a single subject on a simple background, while the remaining 20% depict wild scenes with complex backgrounds.

Evaluation Based on Confusion Matrix
Furthermore, as seen in Figure 13, the suggested model was assessed by employing a confusion matrix for facial emotion recognition. For each of the seven categories, the authors chose 100 facial emotion photos randomly. Roughly 80% of randomly chosen photos depict a single subject on a simple background, while the remaining 20% depict wild scenes with complex backgrounds. The assessment findings indicate that the proposed approach has an accuracy of 69.3%, and the average result of the confusion matrix is 71.4%.

Conclusions
In this study, we employed deep CNN models and facial landmark identification to recognize an individual's emotional state from a face image where the bottom part of the face is covered with a mask. AffectNet, a dataset of images labeled with seven basic emotions, was used to train the suggested face emotion identification algorithm. During the studies, the suggested system's qualitative and quantitative performances were compared to other widespread emotion recognizers based on facial expressions in the wild. The re- The assessment findings indicate that the proposed approach has an accuracy of 69.3%, and the average result of the confusion matrix is 71.4%.

Conclusions
In this study, we employed deep CNN models and facial landmark identification to recognize an individual's emotional state from a face image where the bottom part of the face is covered with a mask. AffectNet, a dataset of images labeled with seven basic emotions, was used to train the suggested face emotion identification algorithm. During the studies, the suggested system's qualitative and quantitative performances were compared to other widespread emotion recognizers based on facial expressions in the wild. The results of the experiments and evaluations showed that the suggested method was effective, with an accuracy of 69.3% and an average confusion matrix of 71.4% for the AffectNet dataset. Assistive technology for the visually impaired can greatly benefit from the suggested facial expression recognition method.
Despite the accuracy mentioned above, the work has some limitations in the various orientation scenarios. Facial landmark features were not correctly obtained due to an orientation issue. Furthermore, the proposed model also failed to recognize emotions when multiple faces were present in the same image at an equal distance from the camera.
The authors plan to further refine the classification model and picture datasets by investigating methods such as semi-supervised and self-supervised learning. As attention CNN relies on robust face identification and facial landmark localization modules, we will investigate how to produce attention parts in faces without landmarks. In addition, we plan to work on the hardware side of the smart glasses to create a device prototype that can help the visually impaired identify people, places, and things in their near surroundings.