A Uniﬁed Framework of Deep Learning-Based Facial Expression Recognition System for Diversiﬁed Applications

: This work proposes a facial expression recognition system for a diversiﬁed ﬁeld of applications. The purpose of the proposed system is to predict the type of expressions in a human face region. The implementation of the proposed method is fragmented into three components. In the ﬁrst component, from the given input image, a tree-structured part model has been applied that predicts some landmark points on the input image to detect facial regions. The detected face region was normalized to its ﬁxed size and then down-sampled to its varying sizes such that the advantages, due to the effect of multi-resolution images, can be introduced. Then, some convolutional neural network (CNN) architectures were proposed in the second component to analyze the texture patterns in the facial regions. To enhance the proposed CNN model’s performance, some advanced techniques, such data augmentation, progressive image resizing, transfer-learning, and ﬁne-tuning of the parameters, were employed in the third component to extract more distinctive and discriminant features for the proposed facial expression recognition system. The performance of the proposed system, due to different CNN models, is fused to achieve better performance than the existing state-of-the-art methods and for this reason, extensive experimentation has been carried out using the Karolinska-directed emotional faces (KDEF), GENKI-4k, Cohn-Kanade (CK+), and Static Facial Expressions in the Wild (SFEW) benchmark databases. The performance has been compared with some existing methods concerning these databases, which shows that the proposed facial expression recognition system outperforms other competing methods.


Introduction
Facial expressions [1] are a crucial non-verbal method of indicating meaning and represent a unique, universal way for people to communicate. The facial expression recognition system (FERS) is a contactless recognition system in which the image of a human face of a person can be captured from a distance without any intervention or interruption, even when he/she is moving around, walking, sitting, or performing activities [2]. Facial expressions play an essential role in our daily communication with people and in social interactions [3]. The FERS is mainly used to identify types of human facial expression [4]. According to Ekman et al. [5], there are six basic expressions, including a neutral face as a baseline reference. Figure 1 shows some examples of human facial expressions, e.g., fear (FA), anger (AN), disgust (DI), happy (HA), neutral (NE), sad (SA), and surprise (SU). The FERS [6] is an emergent research topic in computer vision research areas. It has comprehensive potential in a diversified field of applications [7] with various challenges in healthcare, education, marketing research, business organization, customer and retail fields, government, entertainment, and within the Internet of Things (IoT). There exists several vendors such as Microsoft, IBM, Amazon, and Google that provide some application programming interfaces using facial expressions but with limited solutions. Here, these diversified fields of application for FERS are discussed in terms of the four major concerns, which are as follows:

Fear Anger Disgust Happy Neutral Sad Surprise
e-Healthcare:-The real-time FERS incorporates the healthcare system via the system that is used to analyze and detect the image's visualization of the patient's feelings remotely by identifying facial expressions [8] for patients with different ages, puberty levels, and genders collected from a giant cloud and from social networks. The m-Health provides mobile device-based practices to patients to support their medicine administration and daily healthcare facilities. e-Health is an electronic health service that uses information and communication technology for delivering facilities digitally and for processing patients and doctors through computers for drug administration. Both e-Health and m-Health provide immense support to healthcare industries in building e-Healthcare systems to ensure that patients, doctors, medical professionals, and businesses benefit, as well as to ensure the establishment of a healthy civilization with technological advancements in smart cities. Electronic healthcare systems provide services to patients to physically localize and monitor through recognizing their voice, speech, gesture movement, and facial expressions. Our proposed facial expression recognition system (FERS) will improve the services of healthcare systems. It is a significant challenge to obtain good results in the context of more efficient and less costly health services. Hence, while integrating the FERS into the healthcare framework, all healthcare requirements, such as automated intelligent sensors, sophisticated tools, security, authenticity, access, and privacy, should also be considered. Social IoT:-Social IoT systems represent an evolution of IoT-based systems. It establishes a platform for interconnecting subjects or objects worldwide through social relationships. It provides better services to users by relaxing the common interests between the users. Now, the services of social IoT are exploited in emotion-recognition as these emotions relate to the social activities of humans in their daily life. Hence, the integration of social IoT services will make life easier with several social care facilities for people [9]. The proposed FERS is useful for developing IoT-based smart devices and appliances. It can be used for several entities such as education, marketing research, retail, government, media and content, gaming, and finance. During online teaching, the facial expressions of students can be compared to their interest and understanding of topics that have been taught to them. The sentiments from online trading and investment strategies will be beneficial for the financial development of the organization. The emotion analysis using customer reviews and shopkeepers' experience will bring good marketing research for the organization. Emotion AI:-Emotion AI [10] has wide applications in human resource management, such as in any business organization. It helps the human resource management system (HRMS) during the recruitment of a candidate for selection. This emotion AI considers several traits such as voice and text to analyze the sentiments in candidates. Cognitive AI:-Cognitive AI [11] provides methods and technologies to build a decisionmaking system based on the behavior and reasoning ability of a person. It helps a person to make decisions through a system. Job searching, salary prediction, carrier path selection for job-seeker problems, cyber-security with enabled AI, and natural language processing for sentiment analysis problems are under cognitive AI categories. Thus, social interaction, planning, interpretation, decision-making, competence of emotion, and self-learning capabilities are the processes of cognitive AI.
These applications use facial images for recognizing expressions in humans. The psychology of facial expression [12] states that the face is the key to understanding emotions. Linking the face to emotions may be an important idea in the psychology of emotions. The facial expression recognition system works on the facial movements [13], which are described by the facial action coding system (FACS). The FACS breaks down facial expressions into action units that introduce a distinct change in the facial appearance. There are various uses of FACS for discovering disorders in neuropsychiatric and socialemotional development that are performed through psychological research. The FACS is an immediate, powerful, and effective non-verbal communication tool to transit messages and convey emotional information. In most of the implementation cases of FERS, the facial region is analyzed as a texture where numerous techniques such as statistical and structural-based methods have been employed to extract discriminant features [14]. Apart from these techniques, recently, deep learning-based approaches with convolution neural networks [15] have been employed to extract more discriminant and distinctive features to ensure that a better performance can be obtained. However, most of these methods are database-dependent and these databases have been captured spontaneously under controlled environments [16] with tightly controlled illumination, age, and pose variation conditions.
Despite the current state-of-the-art methods for the FERS and their significant progress in effective computing, they still suffer from some limitations: (i) The employed datasets are either laboratory-controlled or wild. These images are captured under unconstrained environments and the images suffer from several challenging issues such as illumination, poor resolution, occlusion, pose, age, and expression variations. Thus, the extraction of the face region from the input images in optimal time is also a challenging issue. (ii) Due to limited domain knowledge, the local to global feature representation schemes generate less discriminative and distinctive patterns. (iii) The assumption of the feature selection might not be valid, i.e., the extraction of local geometric information or action units' geometric features is not valid. Hence, we have proposed a novel deep learning-based framework for the FERS to address these problems and improve its usefulness in diversified fields of applications such as in e-Healthcare, social IoT, and emotion AI. The contributions of the proposed work are as follows:

•
We have designed a fast and efficient end-to-end deep learning-based framework using the convolutional neural network approach for learning face representation by adding some extra levels of feature representation schemes to improve the robustness and generalization of the model.

•
The obtained predictive model detects and learns powerful high-level features from the input image and extracts more distinctive and discriminant features that provide effective results for the proposed FERS under various illumination changes as well as pose and age variation artifacts.

•
To enhance the performance of the FERS, several experiments have been carried out with a trade-off between the batch vs. epoch, data augmentation, progressive image resizing, hyper-parameter tuning, and transfer learning techniques for the better prediction of expression types on the human face and for improvement of the performance as well as robustness of the proposed system.

•
The proposed method finds the solution for the challenging issues of FERS. At the same time, a series of experiments have been conducted to reduce the training loss and over-fitting problems that arise due to inadequate training data and bias in the expressions' variation.
The organization of this paper is as follows: Section 2 describes the related work for the proposed system. The proposed facial expression recognition system (FERS) is discussed in Section 3, which describes the face pre-processing techniques and the proposed CNN architectures for the feature computation of both frontal and profile facial images. The database description, experimental results, and discussions are described in Section 4. Finally, Section 5 concludes this paper.

Related Work
An automatic facial expression recognition and classification for multi-pose and multilevel face images have revealed to be an attractive and challenging problem since the last thirty years [17]. A literature review stated that early stages of research has focused on several statistical and structural-based methods [14]. In contrast, some [17] template-based and feature-based approaches have also been investigated. The classical methods, such as the Histogram-of-Orientation Gradient (HOG) [18], the Scale Invariant Feature Transform (SIFT) [19], LBP (Local Binary Pattern) [20] features, and some spatio-temporal features (STM-ExpLet [21]), have been adopted by many researchers to obtain texture features in statistical ways; however, these methods require great effort to achieve high performance. Recently, researchers have used convolutional neural networks (CNN, ConvNets) [22] and have achieved great success for large-scale static images and sequences of video recognition [23]. The CNN has been widely applied for the FER system and has significantly improved state-of-the-art practices as well as analyzed the performance of ImageNet classification challenges [22]. Earlier CNN models were used to solve character recognition tasks [24], but nowadays, CNN is widely used in various object recognition problems. Here, the most important ingredient for the success of CNN is the availability of large quantities of training data, i.e., the use of image augmentation techniques [15]. Additionally, the CNN achieves high performance by learning powerful high-level features by combining global appearances to local geometric features rather than conventional handcrafted features. However, the training image samples suffer from the lack of intensity noises, illumination, pose and expression variation, motion blur, low resolution, and occlusion by hair artifacts. The CNN aims towards the application of people-sentiment analysis; application to multimodal human-machine or computer interactions; and application to intelligent systems with their challenges that arise when capturing images under an unconstrained imaging environment.
Depending on the existing state-of-art methods for face representation and facial expression, recognition could be broadly classified and analyzed into two categories: appearance-based methods and facial action units-based methods. In the appearance-based methods, the entire face region is divided into several blocks or patches and the features are extracted from these patches using the Local Binary Pattern (LBP) [25], Histogram of Oriented Gradients (HOG) [26], and Scale Invariant Feature Transform (SIFT) [19], as explored by Zhao et al. [27] and others. Facial action unit-based methods usually exploit the face geometrical information or face action units-driven representation for facial expression classification. Tian et al. [28] used the positions of facial landmarks for facial action unit recognition and then performed expression classification. The appearancebased method [29] is the most successful and well studied for face recognition. There are several works in which the whole face image captured in controlled-lab conditions was taken as the input image I m×n to create a subspace based on the reduction of inconsistent and redundant face space dimensionality reduction techniques [30]; for instance, Fisher LDA, PCA, and LPP [31] had been adopted. A comparative literature review of these methods for facial expression recognition have been done in [32]. The LDA and PCA practically are based on the kernel methods. The Euclidean structure and miscellaneous learning methods [33] have been employed for face recognition [34]. The computational cost of these techniques is expensive and some of these systems may fail due to the system explicitly exhibiting the exact structure of the manifold. However, these are powerful tools based on statistical signal-modeling, which is known as sparse coding. The sparse coding provides beautiful results for the facial expression recognition [35] system. Instead of these handcrafted features, deep learning methods have been assumed to be a breakthrough in computer vision and have broken the world record in the field of recognition task problems.
Many state-of-the-art methods and deep learning frameworks use hand-labeled points and CNN architecture for both feature extraction and built facial expression recognition systems. Gutta et al. [36] proposed a model with an ensemble radial basis function, a grayscale image, and inductive decision trees for the four classes (i.e., Asian, Caucasian, African, and Oriental) ethnicity recognition problem. Zhang and Wang [37] proposed a method for two-class racial classification using multi-scale LBP (Local Binary Pattern) texture features while combining 2D and 3D texture features. Zhang et al. [38] described two types of features, namely the geometry-based features and Gabor-wavelets-based features for the FER System. Bartlett et al. [39] applied the Gabor filters coupled with feature selection and machine learning techniques for recognizing facial expressions on a human face. In [40], Rose applied Gabor and log-Gabor filters on low-resolution images for facial expression recognition. Wu et al. [41] explored the Gabor motion energy filters [42] to recognize the dynamic facial expressions of individuals. Gabor filters together with genetic algorithms (GA) and SVM for the analysis of six basic facial expressions from video sequences were employed in [25]. In [43], Gu et al. proposed a method for facial expression recognition based on the radial encoding of local Gabor features with classifier synthesis. Almaev et al. [44] proposed a new dynamic feature descriptor called the Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) by combining LBP-TOP [45] and Gabor filters.
The major problems that occur during the development of the FER System concern shallow features and bias caused by various cultures and collection conditions. Current datasets have a strong build-in bias and the corresponding proposed methods show that the conditional probability distribution between training and testing datasets are different. We will assess this bias and present novel deep CNN models to address these issues. In our proposed methodology, we considered face recognition as an image classification problem. This face recognition definition has been extended to our work for the classification of facial expressions on human faces. The proposed FER system is based on two backbones: (1) face preprocessing and (2) the design and analysis of features from the proposed CNN architectures. The proposed CNN architecture is built using several convolutional layers, max-pooling, batch normalization, and dropout layers with an optimizer followed by the soft-max classifier for the final classification tasks. Our extensive random experimental results show that our proposed deep-CNN method achieves superior results for facial expression recognition problems for both lab-controlled and real-world databases. The principal issues involved in the facial expression recognition system design are face representation and classifier selection [31]. The face representation concerns extracting feature descriptors from the input face image that minimize the intra-class similarities and maximize the inter-class dissimilarities. In the case of classifier selection, it does not make sense that the high-performance classifiers always find a better separation between different classes even if there are significant similarities. Sometimes, the most sophisticated classifier may fail to execute the facial expression recognition and classification tasks due to inadequate face representations. We cannot achieve high-performance recognition accuracy if we employ good face representation but do not select a good classifier. Hence, the below sections describe the proposed FER system.

Proposed Methodology
In this work, we have proposed a facial expression recognition system (FERS) in the diversified fields of applications, such as e-Healthcare, social IoT, emotion AI, and cognitive AI. The block diagram of the proposed method is demonstrated in Figure 2. Since these fields belong to interdisciplinary research areas, the algorithms and techniques employed during the implementation of these frameworks are interconnected. Thus, the proposed FERS will be used as the common platform for analyzing expressions in the applications of these frameworks. Furthermore, the implementation of a basic FERS is discussed in the following paragraphs. A facial expression recognition system generally consists of face representation, feature extraction, and classifier components. Regarding the importance of face recognition in computer vision research areas, we have proposed robust, efficient, and accurate deep convolution neural network (CNN) models for facial expression recognition systems. Here, an image I m×n with a valid face region is used as an input to the system. Our objective is to predict the types of expressions, such fear, anger, disgust, surprise, sadness, happiness, and neutral, from the input face region I m×n . The proposed facial expression recognition system (FERS) has been implemented in four steps: (i) pre-processing, wherein the bounded box face region is detected from the input image using the tree structure part model [38]; (ii) feature extraction, wherein the global generic CNN features are extracted from the detected bounding box face region and are prepared for the next level through the deep learning model; (iii) the representations are further modified by using multi-stage progressive image resizing followed by transfer learning methods, wherein image augmentation and fine-tuning of parameters have also been adopted; and (iv) classification, which concerns predicting the type of expression classes of the facial region. Each of these steps is described in the block diagram of the proposed system, which is presented in Figure 2. Recently, deep convolution neural network (CNN) techniques have been successfully developed to learn discriminative features in various fields. It is widely being used in deep FER representation. Deep FER suffers from the over-fitting problem due to the lack of sufficient training samples, age variations, head poses, identity bias, and illumination variations. The proposed method focuses on these issues and overcomes the computational complexity of the proposed system.

Face Preprocessing
We have implemented a deep learning framework to recognize discrete human facial expression categories in this proposed work. The input face image has been resized to the same size and was normalized to a fixed size face image I m×n×3 . Here, these input images are mapped to the same locations, i.e., eye locations, the tip of the nose, etc., are known as a feature map. At the lowest level of abstraction, it is assumed that preprocessing is a standard term that concerns computing over intensity images. These input and output intensity images are the same as the original data captured by the sensor. A matrix of image function values usually represents an intensity image. The goal of preprocessing is to enhance the expression of the region of interest and to suppress the unwanted, redundant, and inconsistent noises in the image. Image preprocessing methods are classified into four categories according to the size of the pixel neighborhood that is used for the calculation of new pixel brightness: pixel brightness transformations; geometric transformations; certain preprocessing methods that use a local neighborhood of the processed pixel; and image restoration that requires knowledge about the entire image. Here, the required face region is detected from the input image using a tree-structured part model [46]. The detected face has been resized to a fixed image F n×n . These face images are used as input to the proposed CNN models. During preprocessing, we extracted the face region from each input image I m×n . Since the facial expressions contained very minute details, it is important to be conscious about analyzing both expressive or non-expressive characteristics of the facial region. During face preprocessing, we applied the tree-structured part model, which works better for both frontal and profile face regions compared to Haar-like features [47]. This model has outstanding performance results compared to the other face detection algorithm in computer vision. The tree-structured part model works on the principle of a mixture of trees with a global mixture of topological viewpoints changing. For an unconstrained image with an unknown face region, this model locates all the facial landmarks in I m×n . For facial landmark localization, we consider L p q = (x p q , y p q ) as the coordinate for the pixel location of part q. Hence, the tree-structured part model computes thirty-nine landmark points for profile faces, while computing sixty-eight landmark points for the frontal faces. These landmark points undergo the computation of four corner points of the face region F n×n . The face preprocessing steps are shown in Figure 3.

Feature Representation for Expression Classification
Feature extraction is a crucial task to extract discriminating features from the input image F n×n to ensure that the extracted feature contains more distinctive patterns [48].
Here, the input image may be a grayscaled or RGB color image. In the field of computer vision and in image processing research areas, feature extraction starts from an initial set of measured data and builds the features which are supposed to be informative and non-redundant, facilitating the subsequent learning and generalization steps. In many cases, the texture feature extraction techniques [49] lead to better human interpretations. Moreover, it is related to dimensionality reductions, i.e., when the input image size is too large to be processed as the representation for that image, it is transformed into a reduced set of features, also called a feature vector. The modern state-of-the-art technique for generic CNN feature representations for facial expression recognition problems could compete with the statistical and structural-based methods of computer vision. These generic CNN features can cope with the articulation and occlusion face images captured in an unconstrained environment and can achieve better performance.
The proposed method describes a complex CNN baseline model with two components, i.e., feature extraction and classification parts. The proposed CNN model uses a convolutional neural network architecture with five to seven deep image perturbation layers. The model performs convolution operations using the ReLU activation function followed by max-pooling and batch normalization operations for feature extraction. Finally, two flatten layers, which are fully connected, are used for classification tasks on the extracted feature maps from the top of the layers. The performance for the proposed CNN has been increased through adding new levels by applying image augmentation and progressive image resizing methods. These also help the model to prevent the over-fitting and imbalanced data problem. Progressive image resizing methods support the model to avoid the use of excessive computational power. Considering only the pre-trained weights of the last few layers are being used, these weights have to be learned properly. We take advantage of image augmentation, batch normalization, the activation function, and regularization methods including the mix-up optimizer and label smoothing techniques. The convolution operation is the primary operator and main building block of a CNN architecture. The term convolution is a mathematical operation that combines two functions and generates a third function. Here, it is used to extract features from the images. In the case of a CNN model, the convolution operation is executed over the input image with the help of a t × t sized kernel or filter and then generates feature maps. The convolution operation is performed by sliding the filter followed by non-linearity over the input. At every location, matrix multiplication is performed and sums the result onto the feature map. Finally, we used a fine-tuning of the parameters and fusion methods to enhance the performance of the proposed recognition system. Thus, the descriptions of the employed layers for the proposed CNN architecture are as follows: • Convolution:-The Convolution layer is the core building block of a CNN model that performs most of the computation operations. Convolution is a linear matrix operation consisting of some set of kernels or filters W t×t . The kernel is a small-sized matrix of weights that slide over the input [50] and performs element-wise matrix multiplication. The convolution operation essentially performs dot products between some sets of learnable filters W t×t and local regions of the input image F n×n , and produces an output matrix of dimension n × n . Here, n is calculated by n = n−t+2×P S + 1, where S is the stride that governs how many numbers of cells will be moved by the filter to the right and down, from the top-left corner to the bottom-right corner, in the input image to calculate the next cell in the result. Additionally, P is the padding that shrinks the height and width of the volumes. Mathematically, the formulation of the convolution operation is denoted as follows [51]: for input feature vector F = f(v) and a filter vector W = w(v), the convolution operation is obtained as , where the operator denotes the convolution operation and ., . represents the sliding vector inner product between the input feature f(u) and the flipped kernel w(v − u). It measures the similarity between the two vectors. The primary benefits of the convolution operation are: (i) parameter or weight sharing, as a feature detector is used in one part and transfers into other parts of the image; (ii) the fact that it reduces the number of effective parameters and image translation; and (iii) the sparsity of connections, i.e., the hidden layers' input and output dependencies. • Max-pooling:-A pooling operation is a mathematical operation that performs pixelwise average or median operations to reduce the input image size by half its size. The effective advantages of using pooling operations concern a means of removing noise, correcting images, and overcoming incidental occlusions [52]. The pooling layer is used to reduce the size of the representation to speed up the process as well as to make some of the features it detects more robust. There are different types of pooling operations, such as average pooling, fractional max-pooling, and max-pooling. Max-pooling is a commonly used pooling operation that is used in most CNN models. Max-pooling calculates the maximum value for patches of a feature map and uses it to create a down-sampled feature map. It is usually used after a convolutional layer. The primary benefits of max-pooling are as follows: (i) it is a translation invariance, i.e., it translates the image by a small amount that does not significantly affect the values of most pooled outputs; (ii) has reduced computational costs; (iii) has faster matching; and (iv) has improved accuracy. • Fully Connected Layers:-It has been stated that fully connected layers and convolutional layers are distinct, but it has been observed that fully connected layers are a special case of convolutional layers [53]. In our proposed CNN model, we used two fully connected layers denoted as F C 1 and F C 2 . Here, n 2 neurons in F C 2 have full connections to all activation n 1 in F C 1 . The activation function can be computed with a matrix multiplication followed by a bias offset. Let x ∈ R n 1 ×1 represent the single output vector of layer F C 1 and let W ∈ R n 1 ×n 2 denote the weight matrix of the F C 2 . Suppose w i is the weight vector of the corresponding i th neuron of the column vector of W in layer F C 2 [54]. Then, the output of F C 2 is obtained by W T × x. The output of fully connected layers is independent of the input image size. Fully connected layers of a CNN architecture will reduce the full image size, compute the single vector of class scores, and produce a resulting vector of size Dense Layers:-The dense layer is a type of fully connected connection layer in deep neural networks [55]. In a dense layer, all input layers are connected to the output layers by a weight. It performs linear operations with X inputs parameters and generates X output parameters [56] that are also connected to the next layer as inputs. It utilizes dense connections between layers with matching feature map size X l = g W T X l−1 , where g is the activation function, e.g., ReLU defined as p(x) = max(0, x). • Batch Normalization:-Batch is used to normalize the inputs of the previous layers at each batch, maintaining the values in a comparable range with the mean equal to 0 and the standard deviation equal to 1. This helps the CNN model to prevent skews at any one particular point and increases the computation speed. We applied the batch normalization after every convolution layer and then passed these values to the ReLU activation function. Batch normalization acts as a regularizer and allows the model to use higher learning rates [57]. It is used in various image classification problems and achieves higher accuracy with fewer training steps. Batch normalization also has a beneficial effect on the gradient flow through the network by reducing the dependence of gradients on the scale of their parameters or initial values. It also regularizes the model and reduces the need for dropout layers. We calculated the batch normalization mathematically as follows: For a a mini-batch χ of size m and with values of x (l) , i.e., activation and omit l for clarity, the mini-batch is expressed as ; and scale and shift, y i ← θx i + ψ = BN θ,ψ (x i ).

•
Regularization:-Regularization strategies are designed to reduce the test error of a machine learning algorithm, possibly at the expense of the training error [58]. The popular regularization methods that exist in the field of deep learning [59] are dropout, R1-regularization, and discriminative regularization, among others. We employed the dropout regularization technique on the penultimate layer α = [ a 1 , a 2 , ..., a F ] (F are the numbers of filters) for our proposed deep CNN model with constrain: 2 norms of the weight vector [60]. The dropout regularization technique drops a unit during the training time with a specified probability. Dropout prevents co-adaptation of the network's hidden units by randomly dropping out a portion or setting the hidden units to zero during forward and backward propagation. The neural network becomes too reliant on particular connections. Instead of using γ = ω × α + δ for output hidden unit γ in forward propagation, here, dropout uses γ = ω × (α ⊗ β) + δ, where the operator '⊗' performs element-wise matrix multiplication and β ∈ R F is the masking vector of the Bernoulli random variable. At test time, all units are present and the learned weight vectors are scaled by P such that ω = P × ω, where ω represents the class score computed without dropout. The advantage of using dropout is that it prevents artificial neural networks from over-fitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks. This means that a selected subset of units for each training sample, including their incoming and outgoing connections, are temporarily removed from the network. Suppose a dropout probability of 0.5 is used; in this case, roughly half of the activation in each layer is deleted for every training sample, thus preventing hidden units from relying on other hidden units present.

•
Optimisation:-The proposed FERS problem has been solved by stochastic optimization methods to optimize our CNN models. In this study, we used the popular first-order gradient-based Adam optimizer of the stochastic objective function. The popular optimization methods used for solving FERS problems are Adagard, SGD, RMSProp, SGD with momentum, AggMo, Demon, Demon CM, DFA, and Adadelta optimization methods. They use their stochastic mini-bath method. This method estimates the learning rate based on lower-order momentum. Adam [61] uses only the first two moments of gradient v t and the learning rate or steps size η. The weight updates for the Adam optimizer are mathematically calculated as where is a smaller number. The primary advantages of using the Adam optimizer are that it works well and is suitable for problem-solving for large training data sets. Adam can handle non-stationary objective functions as in RMSProp while overcoming the sparse gradient issue drawbacks that appear in RMSProp. Adam is favorable compared to other stochastic optimizers. The implementation of Adam is straightforward and computationally efficient with less memory required.
The proposed CNN architectures are based on several blocks as discussed in the previous section. Here, an input image F n H ×n W is convolved with a set of kernels of size t × t. These convolution layers are called feature maps. The feature maps are stacked to provide multiple filters on the input. We used 3 × 3 sized filters with a stride of 1 for each convolution layer. The activation function for each convolution layer was ReLU. The computational complexity of the CNN models was reduced by using d × d pooling layers, which reduces the output size from one layer to the next in the hidden network layers. To select maximum elements, we used 2 × 2 max-pooling operations to preserve the important features [62]. Hence, these layers reduce the size of the input image by half. To feed the pooled output from the stacked featured map to the final layer, the maps were flattened into one column. The final layers of the CNN had two fully connected layers with M number of nodes each. Fully connected layers also used the ReLU activation function. These two layers were regularized by using the dropout layers with the regularization technique. Finally, the Softmax layer was employed, followed by two fully connected layers, and the number of nodes of this layer was equal to the number of expression classes.
During the feature representation of images using deep learning approaches, it was observed that the CNN models obtained better representation when patterns were analyzed from the multi-resolution of images. Additionally, increasing some layers in the architecture while increasing the resolution of the images results in more deeply analyzing some hidden patterns in the feature maps. Inspired by these observations, we applied multiresolution of the facial images with varying layers in different CNN architectures. During feature representations, we considered facial images with three different resolutions such that the original facial image F n×n was down-sampled to F n 1 ×n 1 , F n 2 ×n 2 , and F n 3 ×n 3 , n 3 = 2 × n 2 = 4 × n 1 . Here, for facial images, namely F n 1 ×n 1 , F n 2 ×n 2 , and F n 3 ×n 3 , three different CNN architectures, namely CNN 1 , CNN 2 , and CNN 3 , were proposed. These architectures are shown in Figures 4-6, whereas the detailed descriptions of these architectures, including the employed input-output hidden layers, the output shapes of the convoluted images, and the input image sizes and parameters generated at each layer, are shown in Tables 1-3, respectively, to allow for greater understanding and clarity about the models.

F a c i a l E x p r e s s i o n C l a s s F l a t t e n D e n se + B a tc h N o r m a li z a ti o n + A c ti v a ti o n + D r o p o u t D e n se + B a tc h N o r m a li z a ti o n + A c ti v a ti o n + D r o p o u t
Block-1

Factors Affecting the Performance of the Proposed FERS
• Data Augmentation:-The data augmentation technique is used to expand the training samples in order to improve the performance of recognition and the ability to generalize the models. In machine learning, image augmentation techniques artificially increase the amount of training data by applying transformation methods to the existing data [63]. The classical augmentation techniques that were employed are bilateral filtering, unsharp filtering, horizontal flip, vertical flip, Gaussian blur, additive Gaussian noise, image scale, image cropping, translation, image rotation, shear mapping, image zooming, image filling, and contrast normalization methods from [15] for the purpose of image augmentation. The whole training images were flipped horizontally by applying simple image data augmentation techniques. In this work, we applied these techniques for each resolution of the images. • Fine Tuning:-Fine-tuning allows for higher-order feature representations in the base model to make them more relevant for the face recognition tasks. For example, VGG used many layers and generated a higher dimensional feature vector, and thus thw inference was quite costly at run-time due to huge parameters. In this case, fine-tuning techniques were applied when freezing some layers and the number of parameters, and the model was retrained to reduce computational overheads. • Progressive Resizing:-Progressive image resizing is an eminent technique that sequentially resizes all images while training the CNN models on smaller, i.e., tinier images to larger image sizes. The progressive resizing technique is used to train a CNN with n × n image size, saving the weights, and then the CNN is retrained again for other iterations with the images of increased sizes greater than n. This technique was used for super-resolution [64], where low-resolution images gradually increased to the image with a higher resolution during training processes. The advantages of using progressive resizing are that it improves generalization and reduces overfitting problems. • Transfer Learning:-The principle concept behind transfer learning for facial expression recognition and classification problems is that a model trained on large data sets for one problem is effectively used as a generic model in some way on other related problems. The model that has been trained earlier is known as the pre-trained model. Our proposed deep learning convolution neural network model uses a transfer learning technique in which the weights of the pre-trained model and/or a set of layers from the pre-trained model CNN 1 are used for the new model CNN 2 to solve similar problems. Similarly, the weights of CNN 2 have been adopted to solve the CNN 3 model. The benefits of using transfer learning are that it reduces the training time and can result in lower generalization errors. • Scores Fusion:-In the proposed system, three CNN architectures have been proposed. These architectures take images of different sizes as inputs. Thus, during the recognition of facial expressions on the test sample F, there are three different classification score vectors, namely s 1 = (a 1 1 , a 1 2 , · · · , a 1 7 ), s 2 = (a 2 1 , a 2 2 , · · · , a 2 7 ), and s 3 = (a 3 1 , a 3 2 , · · · , a 3 7 ), where each a i j is the classification score by the CNN i architecture and for j th expression class. These classification scores are fused together using score-level post-classification fusion approaches [14] to increase the performance of the recognition system. In this work, two score-level fusion techniques, namely Sumrule and Product-rule, were employed. The Sum-rule and Product-rule techniques are defined as follows:

Experimentation
In this section, the experimentation of the proposed FER system is discussed and for this purpose, four challenging benchmark facial expression databases were experimented on. Each database was randomly divided into 50% of the dataset for training, while the remaining 50% was used for testing purposes. Finally, this partitioning of the datasets were done ten times and the average performance was reported, corresponding to each database. As there were no particular benchmark datasets specifically built for the healthcare scenario, social IoT, emotion AI, and cognitive AI, the employed datasets were assumed as backbones for e-Healthcare, social IoT, emotion AI, and cognitive-AI diversified applications as discussed in this paper. The proposed system has not been tested in a real-time scenario. Still, the employed datasets were very challenging. The proposed method can accept and handle all the unconstrained situations of facial expression recognition in the real-time strategy for e-Healthcare, social IoT, emotion AI, and cognitive AI applications.

Database Used
The first employed database was Karolinska-directed emotional faces (KDEF) [42] which contains seventy different subjects (thirty-five male and thirty-five female) with five different pose variations labeled with seven basic expression categories. Here, we used only 1210 samples as training sets, whereas 1213 samples were used as testing sets, as only these samples were available from the license agreement downloaded site. Figure 7 shows some examples from this database. The second employed database is the GENKI database [65], which is composed of 4000 facial images that have been labeled as two classes: (i) happy and (ii) non-happy. Additionally, for this database, two thousand images were randomly selected as training sets, while the remaining two thousand images were considered for the testing set. Some examples of this database are shown in Figure 8.
The Extended Cohn-Kanade (CK+) [66] is our third database, which is composed of 593 video sequences from 123 subjects captured between the ages of 18 to 50 years. Here, only 309 image sequences were labeled with six basic expressions. During the experimentation, we randomly split this database into the training and testing sets. Figure 9 presents some image samples from the CK+ database.  Our fourth database was Static Facial Expressions in the Wild (SFEW) [26], which was created from the AFEW video database by selecting the keyframes based on facial point clustering. The challenging SFEW dataset contains 700 images which were divided into the training set (346 images) and testing set (354 images). This database has seven facial expression classes, namely afraid, anger, disgust, happiness, neutral, sadness, and surprise. Figure 10 presents image samples from the SFEW database.  Table 4 presents the detailed descriptions of the KDEF, GENKI, CK+, and SFEW facial expression databases. The objectives of the selection of these databases were (i) to obtain expressions that were common and generic in people; (ii) to ensure that the other expressions used in the affecting computing research areas were composed of a mixture of these basic facial expressions; and (iii) to ensure that the good recognition system for these expressions would be very beneficial for several real-world applications, such as in e-Healthcare frameworks,in the social Internet of Things (IoT), and in emotion AI in business organizations.

Results and Discussion
This section describes and explains the experimentation of the proposed facial expression recognition system (FERS). The proposed FERS was implemented using Python 3.7.9 version, Tensorflow 2.3.1 version, Keras 2.4.3 version, CUDA version 11.2, and NVIDIA-SMI 460.79 Driver Version in Windows 10 Pro 64-bit, Intel(R) Core(TM)-i7-9700 CPU, 3.30 GHz(8 CPU) Processor, and in a 8 GB NVIDIA GeForce RTX 2070 SUPER XLA GPU device with 16 GB RAM. During experimentation, we employed both gray-scaled and RGBcolored images as some databases have RGB images while others only have gray-scaled images. During image preprocessing, from each input image I, we detected the face region by applying the methods discussed in Section 3.1. Furthermore, the detected face region F was normalized to a fixed size image F ∈ R 200×200 . For recognizing the expression classes on the human face, in this work, we employed deep learning-based approaches where three convolutional neural network (CNN) architectures (Figures 4-6) were designed. These CNN architectures were trained in such a way that they would perform both feature computation and expression classification tasks. For better understanding the functionality of these architectures, at first, we started the experiment using CNN 1 architecture (Figure 4), where the input to this system is an image F ∈ R n 1 ×n 1 , n 1 = 48, i.e., the training F 48×48 samples were used to train CNN 1 architecture while the performance of the trained CNN 1 model was evaluated using the remaining testing samples. Learning the parameters in any CNN architecture is a very important task and depends on two factors, i.e., epochs and batches. Both these factors affect the learning capabilities of the architecture during the training of samples in the network. Thus, a trade-off between epochs and batches was established, which improved the performance of FERS using CNN 1 architecture (Figure 4). The demonstration of the performance with the trade-off between epochs and batches is shown in Figure 11 and from this figure, it is observed that the performance gradually improved with increasing epochs (best performance at nearly 700 epochs) while keeping 16 batches fixed.
Inspired by the experiment shown in Figure 11, another experiment was conducted using CNN 1 architecture while keeping the fixed batch = 16 with varying epochs with respect to the KDEF, CK+, and SFEW database; the performance is shown in Figure 12. From this figure, we can observe that during epochs between 700 and 800, the performance for each database was good. Figure 11. Effect on the performance of the proposed FERS due to the trade-off between epochs and batches using CNN 1 architecture for the CK+ database.

Accuracy (%)
Epoch 500 600 700 800 Figure 12. Effect on the performance of the proposed FERS while keeping the batch fixed with varying epochs using CNN 1 architecture for the KDEF, CK+, and SFEW database.

Effect of Data Augmentation Techniques
The data augmentation techniques were applied to training samples to increase the number of samples. The increased training samples learned the parameters of CNN 1 architecture well and obtained a better performance. Moreover, in order to adapt the diversity of the training data and to avoid overfitting problems, data augmentation plays an important role. In this work, each sample of the training images was horizontally and then vertically flipped. Then, Affine transformations such as rotation, scaling, zooming, and shearing operations were performed. For the data augmentation technique, we employed the methods mentioned in [15], which derives seventeen images for each sample. Figure 13 shows the effect of data augmentation techniques on the performance of FERS using CNN 1 architecture and from this figure, we can observe that the performance of the proposed FERS increases due to the employed data augmentation techniques. Figure 13. Effect on the performance of the proposed FERS using proposed data augmentation techniques.

Effect of Progressive Image Resizing
The progressive image resizing technique has been discussed in Section 3.3. During experimentation, the preprocessed face region F ∈ R 200×200 was further down-sampled into n 1 × n 1 × 3, 2n 1 × 2n 1 × 3, and 4n 1 × 4n 1 × 3 size images. Here, we already considered n 1 = 48 in the above experiments. Hence, in progressive image resizing, we down-sampled F ∈ R 200×200 to F ∈ R 48×48 , F ∈ R 96×96 , and F ∈ R 192×192 . In other words, the CNN 1 architecture (Figure 4) was trained with F 48×48×3 images. Then, F 96×96×3 images were used to train CNN 2 architecture ( Figure 5). Lastly, the CNN 3 architecture was trained with F 192×192×3 images. The purpose behind learning these architectures with the increasing image sizes concern the fact that (i) the high-resolution images are trained in the network; (ii) the effect of multi-resolution approaches can be introduced in the network such that the texture patterns at the higher level of abstraction will be reflected during the learning of parameters; and (iii) the system will provide deeper information that would be beneficial for the hierarchical representations of features. Hence, the use of progressive image resizing not only increases the performance of the recognition system but also reduces the overfitting problems. The effect of progressive image resizing on the performance of the proposed system is reported in Table 5 and from this table, we can observe that for the KDEF, GENKI, CK+, and SFEW databases, the proposed FERS exhibits a better performance for F 192×192×3 images than for both F 96×96×3 and F 48×48×3 images. Moreover, it is evident that both progressive image resizing and data augmentation techniques together are very effective for the proposed CNN models for recognizing facial expression on facial regions. Table 5. Effect of the progressive image resizing on the performance of the proposed FERS where the first, second, and third row for each database shows the accuracy in percentage using CNN 1 (F ∈ R 48×48 ), CNN 2 (F ∈ R 96×96 ), and CNN 3 (F ∈ R 192×192 ) models, respectively.  Figure 14 presents the effect of transfer learning approaches on the performance of the proposed FERS. Here, only the performance of the CNN 3 model trained with F 192×192×3 images is shown using both progressive image resizing and data augmentation techniques. Figure 14. Effect of transfer learning on the performance of the proposed FERS.

Effect of Score Fusion
Score fusion techniques defined in Equations (1) and (2) were applied on the classification scores obtained by the CNN 1 , CNN 2 , and CNN 3 architectures, and the results are reported in Table 6. To better understand the performance of the proposed FERS, the confusion matrices are shown in Figure 15, corresponding to the KDEF, GENKI, CK+, and SFEW facial expression database. Here, each confusion matrix represents the product-rule-based fusion performance of the proposed FERS.

Comparison
To compare the performance of the proposed methodology, we computed features from the competing methods and obtained the performance under the same trainingtesting protocol. Here, the performance of methods of Vgg16 [67], ResNet50 [68], that of Zavare et al. [42], Inception-v3 [69], and that of Rao et al. [70] were compared with the performance of the proposed system for the KDEF database, as presented in Table 7. For the GENKI database the performance of the proposed system was compared with Vgg16, ResNet50, Inveption-v3, that of An et al. [29], that of Zhang et al. [71], and that of Gao et al. [72], and the competing methods are presented in Table 8. Similarly, we compared the performance of the proposed system with that of Sun et al. [73], ResNet50, and Inveption-v3 for the CK+ database in Table 9. For the SFEW database, the performance of Liu et al. [74], Vgg16, ResNet50, and the Inveption-v3 methods were compared in Table 10. The comparison of performance, as presented in Tables 7-10, shows the superiority of the proposed system.   (7), train/test split Zavare et al. [42] 72.55 Images used (980), expression class (7) Images type (frontal), 10-fold cross validation Inception-v3 [69] 75.04 Images used (980), expression class (7), train/test split Rao et al. [70] 74.05 Images used (720), expression class (6) Images type (frontal), 10-fold cross validation   Table 9. Performance comparison of the proposed FERS for the CK+ database.

Conclusions
A novel method for facial expression recognition systems has been proposed in this work. The objective of the proposed system is to predict the seven basic types of expressions on the human face. The applications of this proposed system have been well described and demonstrated in the diversified fields of e-Healthcare, social IoT, emotion AI, and cognitive AI. The implementation of the proposed system has three components. In the first component, an image preprocessing task has been performed where a face region was extracted from a body silhouette image using the facial landmark points. Then, in the second component, from the extracted face region, the multi-resolution images were considered. The convolutional neural network architectures have been proposed for each resolution of the images. Here, the images undergo the CNN architectures and are classified into seven basic facial expression classes based on learning the parameters of CNN models. To enhance the performance of the recognition system and better handle the challenging issues of the facial expression recognition system, some advanced techniques such as image augmentation, progressive image resizing, transfer-learning, and fine-tuning of parameters were employed in the third component. Finally, fusion methods were applied to the best performance of the different CNN models to achieve a better performance than the existing state-of-the-art methods. Extensive experimentation has been performed using four benchmark databases, namely KDEF, GENKI-4k, CK+, and SFEW, and the performance of the proposed system has been compared with some existing methods concerning each database. The comparison of the performance of the proposed method with the competing methods shows the superiority of the proposed system.  Institutional Review Board Statement: The ethics committee or institutional review board approval is not required for this manuscript. This research respects all the sentiments, dignity, and intrinsic values of animals or humans.

Informed Consent Statement: Not applicable.
Data Availability Statement: In this manuscript, the employed datasets have been taken with license agreements from the corresponding institutions with proper channels.

Conflicts of Interest:
The authors declare no conflict of interest.