Facial Emotion Recognition Using Transfer Learning in the Deep CNN

Human facial emotion recognition (FER) has attracted the attention of the research community for its promising applications. Mapping different facial expressions to the respective emotional states are the main task in FER. The classical FER consists of two major steps: feature extraction and emotion recognition. Currently, the Deep Neural Networks, especially the Convolutional Neural Network (CNN), is widely used in FER by virtue of its inherent feature extraction mechanism from images. Several works have been reported on CNN with only a few layers to resolve FER problems. However, standard shallow CNNs with straightforward learning schemes have limited feature extraction capability to capture emotion information from high-resolution images. A notable drawback of the most existing methods is that they consider only the frontal images (i.e., ignore profile views for convenience), although the profile views taken from different angles are important for a practical FER system. For developing a highly accurate FER system, this study proposes a very Deep CNN (DCNN) modeling through Transfer Learning (TL) technique where a pre-trained DCNN model is adopted by replacing its dense upper layer(s) compatible with FER, and the model is fine-tuned with facial emotion data. A novel pipeline strategy is introduced, where the training of the dense layer(s) is followed by tuning each of the pre-trained DCNN blocks successively that has led to gradual improvement of the accuracy of FER to a higher level. The proposed FER system is verified on eight different pre-trained DCNN models (VGG-16, VGG-19, ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3 and DenseNet-161) and well-known KDEF and JAFFE facial image datasets. FER is very challenging even for frontal views alone. FER on the KDEF dataset poses further challenges due to the diversity of images with different profile views together with frontal views. The proposed method achieved remarkable accuracy on both datasets with pre-trained models. On a 10-fold cross-validation way, the best achieved FER accuracies with DenseNet-161 on test sets of KDEF and JAFFE are 96.51% and 99.52%, respectively. The evaluation results reveal the superiority of the proposed FER system over the existing ones regarding emotion detection accuracy. Moreover, the achieved performance on the KDEF dataset with profile views is promising as it clearly demonstrates the required proficiency for real-life applications.


Introduction
Emotions are fundamental features of humans that play important roles in social communication [1,2]. Humans express emotion in different ways, such as facial expression [3,4], speech [5], body language [6]. Among the elements related to emotion recognition, facial expression analysis is the most popular and well-researched area. Ekman and Friesen [7], Next, with facial emotion data, the model is fine-tuned using the pipeline strategy, where the dense layers are tuned first, followed by tuning of each DCNN block successively. Such fine-tuning gradually improves the accuracy of FER to a high level without the need to model a DCNN from scratch with random weights. The emotion recognition accuracy of the proposed FER system is tested on eight different pre-trained DCNN models (VGG-16, VGG-19, ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3, and DenseNet-161) and well-known KDEF and JAFFE facial image datasets. FER on the KDEF dataset is more challenging due to the diversity in images with different profile views along with frontal views, and most of the existing studies considered a selected set of frontal views only. The proposed method is found to show remarkable accuracy on both datasets with any pretrained model. The evaluation results reveal the superiority of the proposed FER system over the existing ones in terms of emotion detection accuracy. Moreover, the achieved performance on the KDEF dataset with profile views is promising as it clearly meets the proficiency required for real-life industry applications.
The main contributions of this study can be summarized as follows.
(i) Development of an efficient FER method using DCNN models handling the challenges through TL. (ii) Introduction of a pipeline training strategy for gradual fine-tuning of the model up to high recognition accuracy. (iii) Investigation of the model with eight popular pre-trained DCNN models on benchmark facial images with the frontal view and profile view (where only one eye, ear, and one side of the face is visible). (iv) Comparison of the emotion recognition accuracy of the proposed method with the existing methods and explore the proficiency of the method, especially with profile views that is important for practical use.
The rest of the paper is organized as follows: Section 2 briefly reviews the existing FER methods. Section 3 gives a brief overview of CNN, DCNN models, and TL for a better understanding of the proposed FER. Section 4 explains the proposed TL-based FER. Section 5 presents experimental studies. Section 6 gives an overall discussion on model significance, outcomes on benchmark datasets and related issues. Section 7 concludes the paper with a discussion on future research directions.

Related Works
Several techniques have been investigated for FER in the last few decades. The conventional pioneer methods first extract features from the facial image and then classify emotion from feature values. On the other hand, the recent deep learning-based methods perform the FER task by combining both the steps in its single composite operational process. A number of studies reviewed and compared the existing FER methods [17,18,41,42], and the recent ones among them [41,42] included the deep learningbased methods. The following subsections briefly describe the techniques employed in the prominent FER methods.

Machine Learning-Based FER Approaches
Automatic FER is a challenging task in the artificial intelligence (AI) domain, especially in its machine learning subdomain. Different traditional machine learning methods (e.g., K-nearest neighbor, neural network) are employed through the evolution of the FER task. The pioneering FER method by Xiao-Xu and Wei [43] added wavelet energy feature (WEF) to the facial image first, then used Fisher's linear discriminants (FLD) to extract features and finally classify emotion by using K-nearest neighbor (KNN) method. KNN was also used for classification in FER by Zhao et al. [44], but they used principal component analysis (PCA) and non-negative matrix factorization (NMF) for feature extraction. Feng et al. [45] extracted local binary pattern (LBP) histograms from different small regions of the image, combined those into a single feature histogram, and finally, used a linear programming (LP) technique to classify emotion. Zhi and Ruan [46] derived facial feature vectors from 2D discriminant locality preserving projections. Lee et al. [47] extended wavelet transform for 2D, called contourlet transform (CT), for feature extraction from the image and used a boosting algorithm for classification. Chang and Huang [48] incorporated face recognition in FER for better expression recognition of individuals, and they used radial basis function (RBF) neural network for classification.
A number of methods used the support vector machine (SVM) to classify emotion from extracted feature values using distinct techniques. In this category, Shih et al. [49] investigated various feature representations (e.g., DWT, PCA); and DWT with 2D-linear discriminant analysis (LDA) is shown to outperform others. Shan et al. [50] evaluated different facial representations based on local statistical features and LBPs with different variants of SVM in their comprehensive study. Jabid et al. [51] investigated an appearancebased technique called local directional pattern (LDP) for feature extraction. Recently, Alshami et al. [35] investigated two feature descriptors called facial landmarks descriptor and center of gravity descriptor with SVM. The comparative study of Liew and Yairi [17] considered SVM and several other methods (e.g., KNN, LDA) for classification on feature extracted employing different methods, including Gabor, Haar, and LBP. The most recent study by Joseph and Geetha [52] investigated different classification methods, which are logistic regression, LDA, KNN, classification and regression trees, naive Bayes, and SVM on their proposed facial geometry-based feature extraction. The major limitation of the aforementioned conventional methods is that they only considered frontal views for FER as features from frontal and profile views are different through traditional feature extraction methods.

Deep Learning-Based FER Approaches
The deep learning approach for FER is a relatively new approach in machine learning, and hitherto several CNN-based studies have been reported in the literature. Zhao and Zhang [22] integrated a deep belief network (DBN) with the NN for FER, where the DBN is used for unsupervised feature learning, and the NN is used for the classification of emotion features. Pranav et al. [26] considered a standard CNN architecture with two convolutional-pooling layers for FER on self-collected facial emotional images. Mollahosseini et al. [21] investigated a larger architecture adding four inception layers with two convolutional-pooling layers. Pons and Masip [53] formed an ensemble of 72 CNNs, where individual CNNs were trained with different sizes of filters in convolutional layers or the different number of neurons in fully connected layers. Wen et al. [54] also considered the ensemble of CNNs, but they trained 100 CNNs, and the final model was with a selected number of CNNs. Ruiz-Garcia et al. [36] initialized weights of CNN with encoder weights of stacked convolutional auto-encoder and trained with facial images. Such CNN initialization is shown to outperform CNN with random initialization. Ding et al. [55] extended deep face recognition architecture to the FER and proposed an architecture called FaceNet2ExpNet. Further, FaceNet2ExpNet is extended by Li et al. [23] using transfer learning. Jain et al. [56] [60] considered unlabeled data along with labeled data in their CNN-based method. On the other hand, Porcu et al. [61] evaluated different data augmentation techniques, including synthetic images to train the deep CNN, and a combination of synthetic images with other methods performed better for FER. The existing deep learning-based methods have also considered the frontal images, and most of the studies even excluded the profile view images of the dataset in the experiment to make the task easy [17,35,36,61].

Overview of CNN, Deep CNN Models and Transfer Learning (TL)
Pre-trained DCNN models and TL technique are the basis of this study. Several pre-trained DCNN models are investigated to identify the best-suited one for FER. The following subsections present an overview of CNN, considered DCNN models, and TL motivation to make the paper self-contained.

Convolutional Neural Network (CNN)
Due to the inherent structure of CNN, it is the best suitable model for the image domain [20]. A CNN consists of an input layer, multiple convolutional-pooling hidden layers, and an output layer. Basically, convolution is a mathematical operation on two functions to produce a third function expressing a modified shape of the function. The small-sized (e.g., 3 × 3, 5 × 5) kernel of a CNN slides through the image to find useful patterns in it through convolution operation. Pooling is a form of non-linear downsampling. A pooling layer combines non-overlapping areas at one layer into a single value in the next layer. Figure 1 shows the generic architecture of a standard CNN with two convolutional-pooling layers. The 1st convolution layer applies convolution operation onto the input image and generates the 1st convolved feature maps (CMFs) those are the input of successive pooling operation. The 1st pooling operation produces the 1st subsampled feature maps (SFMs). After the 1st pooling, the 2nd convolutional-pooling layer operations are performed. Flattening the 2nd SFMs' values, the fully connected layer (i.e., dense layer) performs the final reasoning where the neurons are connected to all activations in the previous layer. The final layer, also called the loss layer, specifies how training penalizes the deviation of the actual output from the predicted output. Such a CNN architecture is popular for pattern recognition from small-sized (e.g., 48 × 48) input images such as handwritten numeral recognition, and the detailed description of CNN is available in the existing studies [19,62].
the studies even excluded the profile view images of the dataset in the experiment to make the task easy [17,35,36,61].

Overview of CNN, Deep CNN Models and Transfer Learning (TL)
Pre-trained DCNN models and TL technique are the basis of this study. Several pretrained DCNN models are investigated to identify the best-suited one for FER. The following subsections present an overview of CNN, considered DCNN models, and TL motivation to make the paper self-contained.

Convolutional Neural Network (CNN)
Due to the inherent structure of CNN, it is the best suitable model for the image domain [20]. A CNN consists of an input layer, multiple convolutional-pooling hidden layers, and an output layer. Basically, convolution is a mathematical operation on two functions to produce a third function expressing a modified shape of the function. The smallsized (e.g., 3 × 3, 5 × 5) kernel of a CNN slides through the image to find useful patterns in it through convolution operation. Pooling is a form of non-linear downsampling. A pooling layer combines non-overlapping areas at one layer into a single value in the next layer. Figure 1 shows the generic architecture of a standard CNN with two convolutional-pooling layers. The 1st convolution layer applies convolution operation onto the input image and generates the 1st convolved feature maps (CMFs) those are the input of successive pooling operation. The 1st pooling operation produces the 1st subsampled feature maps (SFMs). After the 1st pooling, the 2nd convolutional-pooling layer operations are performed. Flattening the 2nd SFMs' values, the fully connected layer (i.e., dense layer) performs the final reasoning where the neurons are connected to all activations in the previous layer. The final layer, also called the loss layer, specifies how training penalizes the deviation of the actual output from the predicted output. Such a CNN architecture is popular for pattern recognition from small-sized (e.g., 48 × 48) input images such as handwritten numeral recognition, and the detailed description of CNN is available in the existing studies [19,62].

DCNN Models and TL Motivation
A DCNN has many hidden convolutional layers, and it takes high dimensional images leading to very challenging input and training. Different DCNN models hold different significant arrangements and connections in the convolutional layers [19]. The first model that obtained good accuracy on ImageNet was AlexNet [63] which uses five layers in CNN. ZFnet [64] is based on a similar idea but with fewer parameters that achieved an equivalent accuracy level. It has replaced big kernels with smaller ones. While AlexNet used 15 million images for training, ZFNet used only 1.3 million images to get a similar Input Image

DCNN Models and TL Motivation
A DCNN has many hidden convolutional layers, and it takes high dimensional images leading to very challenging input and training. Different DCNN models hold different significant arrangements and connections in the convolutional layers [19]. The first model that obtained good accuracy on ImageNet was AlexNet [63] which uses five layers in CNN. ZFnet [64] is based on a similar idea but with fewer parameters that achieved an equivalent accuracy level. It has replaced big kernels with smaller ones. While AlexNet used 15 million images for training, ZFNet used only 1.3 million images to get a similar result. Later, VGG-16 [29] proposed a deeper model of depth 16 with 13 convolutional layers and smaller kernels [65]. VGG-19 is another model in this category with 16 convolutional layers.
An important concept employed by most of the later models is the skip-connection [33], which was introduced in residual neural network (ResNet) [33]. The basic idea of skipconnection is to direct the input of a layer and add it to the output after some layers.
This provides more information to the layer and helps to overcome the vanishing gradient problem. Currently, a few different ResNet models with different depth is available, for example, ResNet-18, 34, 50 and 152 are among them; the number of convolutional layers in a model has one less layer than the depth size mentioned with the model's name (e.g., ResNet-50 has 49 convolutional layers).
Along with a single skip connection, DenseNet [34] introduced dense skip connection among layers. This means, each layer receives signals from its previous layers, and the output of that layer is used by all the subsequent layers. The input of a single layer is combined with the channel concatenation of previous layers. Traditional CNNs with L layers have L direct connections; on the contrary, DenseNet has [L(L + 1)/2] direct connections. Since each layer has direct access to its preceding layers, there is a lower information bottleneck in the network. Thus, the model becomes much thinner and compact and yields high computational efficiency. DenseNet blocks are built up by concatenating feature maps, so the input to deeper layers will be extensively incurring massive computation for deeper layers. They use relatively cheaper convolution with size one by one to reduce the dimension of channels that also improves the parameter efficiency. Along with these, non-linearity on the kth layer is calculated by concatenating 0-(k − 1) features and using a non-linear function of this feature map. There are several versions of this model, and the DenseNet-161 contains 157 convolutional layers with four modules.
Inception [66] is another deep CNN model which is built up using several modules. The basic idea behind the inception is to try different filters and stack the modules up after adding non-linearity. This helps getting rid of picking up a fixed filter and lets the network learn whatever combinations of these filters it wants. This module uses one by one convolution to shrink the number of channels so that one can reduce the computation cost. Besides stacking these inception modules, the network has some branch layers, which also predicts the model output and gives some prior idea of whether the model is overfitting or under-fitting. There are several versions of the inception model, and the Inception-v3 contains 40 convolutional layers with several inception modules.
Training any large DCNN model is a complex task as the network has many parameters to tune. It is common that a massive network requires relatively large training data. Training with a small or insufficient amount of data might result in over-fitting. For some tasks, it is difficult to get a sufficient amount of data for proper training of the DCNN. Moreover, a huge amount of data is not readily available for some cases. However, research has shown that TL [37,38] can be very useful to solve this issue. Basically, TL is a concept to use the knowledge representation learned from the different tasks but similar applications. It is reported that the TL technique works better when both tasks are similar [37,39]. Recently, TL has been investigated on the task different from its training and is shown to achieve good results [67,68], which is the motivation behind this study.

Facial Emotion Recognition (FER) Using TL in Deep CNNs
FER using a pre-trained DCNN model through appropriate TL is the main contribution of this study. Mahendran and Vedaldi [69] visualized what CNN layers learn. The first layer of the CNN captures basic features like the edge and corners of an image. The next layer detects more complex features like textures or shapes, and the upper layer follows the same mechanism towards learning more complex patterns. As the basic features are similar in all images, tasks for FER in the lower layers in DCNN are identical to other image-based operations, such as classification. Since training a DCNN model from scratch (i.e., with randomly initialized weights) is a huge task, a DCNN model already trained on another task can be fine-tuned employing the TL approach for emotion recognition. A DCNN model (e.g., VGG-16) pre-trained with a large dataset (e.g., ImageNet) for image classification [29] is suitable for FER. The following subsections describe TL concepts for FER and the proposed FER method in detail with required illustrations. Figure 2 shows the general architecture of a TL-based DCNN model for FER, where the convolutional base is a part of pre-trained DCNN excluding its own classifier, and the classifier on the base is the newly added layers for FER. As a whole, repurposing a pretrained DCNN comprises two steps: replacement of the original classifier with a new one and fine-tune the model. The added classifier part is generally a combined dense layer(s) of those that are fully connected. From a practical point of view, both selecting a pre-trained model and determining a size-similarity matrix for fine-tuning are important in TL [40,70]. There are three widely used strategies for training the model in fine-tuning: train the entire model, train some layers leaving others frozen and train the classifier only (i.e., freeze the convolution base) [71]. In the case of a similar task, training the only classifier and/or few layers is enough in fine-tuning for learning the task. On the other hand, for dissimilar tasks, full model training is essential. Thus, fine-tuning is performed on the added classifier and a selected portion (or full) of the convolution base. A portion selection for fine-tuning and appropriate training methods for fine-tuning are tedious jobs to get better FER, which tasks are managed in this study through a pipeline strategy.
another task can be fine-tuned employing the TL approach for emotion recognition. A DCNN model (e.g., VGG-16) pre-trained with a large dataset (e.g., ImageNet) for image classification [29] is suitable for FER. The following subsections describe TL concepts for FER and the proposed FER method in detail with required illustrations. Figure 2 shows the general architecture of a TL-based DCNN model for FER, where the convolutional base is a part of pre-trained DCNN excluding its own classifier, and the classifier on the base is the newly added layers for FER. As a whole, repurposing a pretrained DCNN comprises two steps: replacement of the original classifier with a new one and fine-tune the model. The added classifier part is generally a combined dense layer(s) of those that are fully connected. From a practical point of view, both selecting a pretrained model and determining a size-similarity matrix for fine-tuning are important in TL [40,70]. There are three widely used strategies for training the model in fine-tuning: train the entire model, train some layers leaving others frozen and train the classifier only (i.e., freeze the convolution base) [71]. In the case of a similar task, training the only classifier and/or few layers is enough in fine-tuning for learning the task. On the other hand, for dissimilar tasks, full model training is essential. Thus, fine-tuning is performed on the added classifier and a selected portion (or full) of the convolution base. A portion selection for fine-tuning and appropriate training methods for fine-tuning are tedious jobs to get better FER, which tasks are managed in this study through a pipeline strategy.  Figure 3 illustrates the proposed FER system with VGG-16, the well-known pretrained DCNN model. The available VGG-16 model is trained with the ImageNet dataset to classify 1000 image objects. The pre-trained model is modified for emotion recognition redefining the dense layers, and then fine-tuning is performed with emotion data. In defining the architecture, the last dense layers of the pre-trained model are replaced with the new dense layer(s) to recognize a facial image into one of seven emotion classes (i.e., afraid, angry, disgusted, sad, happy, surprised, and neutral). A dense layer is a regular, fully connected, linear layer of a NN that takes some dimension as input and outputs a vector of the desired dimension. Therefore, the output layer contains only seven neurons. The fine-tuning is performed on the architecture having the convolution base of the pretrained model plus the added dense layer(s). A cleaned emotion dataset prepared through preprocessing (i.e., resizing, cropping, and other tasks) is used to train in fine-tuning. In the case of testing, a cropped image is placed to the input of the system, and the highest output probability of emotion is considered to be the decision. VGG-16 may be replaced with any other DCNN models, e.g., ResNet, DenseNet, Inception. Therefore, the size of the proposed model depends on the size of the pre-trained model used and the architecture of the added dense layer(s). Figure 4 shows the detailed schematic architecture of the proposed model with a detailed pre-trained VGG-16 model, plus dense layers for FER. The green section in the figure is the added portion having three fully connected layers in a cascade fashion. The first one is a 'Flatten' layer, which converts the metric into a one-dimensional vector; its task is the only representation and to make it compatible for emotion recognition operation in the next layers, and no operation is performed on the data. The other two layers are densely connected: the first one is a hidden layer that converts the comparatively higher-  Figure 3 illustrates the proposed FER system with VGG-16, the well-known pre-trained DCNN model. The available VGG-16 model is trained with the ImageNet dataset to classify 1000 image objects. The pre-trained model is modified for emotion recognition redefining the dense layers, and then fine-tuning is performed with emotion data. In defining the architecture, the last dense layers of the pre-trained model are replaced with the new dense layer(s) to recognize a facial image into one of seven emotion classes (i.e., afraid, angry, disgusted, sad, happy, surprised, and neutral). A dense layer is a regular, fully connected, linear layer of a NN that takes some dimension as input and outputs a vector of the desired dimension. Therefore, the output layer contains only seven neurons. The fine-tuning is performed on the architecture having the convolution base of the pre-trained model plus the added dense layer(s). A cleaned emotion dataset prepared through preprocessing (i.e., resizing, cropping, and other tasks) is used to train in fine-tuning. In the case of testing, a cropped image is placed to the input of the system, and the highest output probability of emotion is considered to be the decision. VGG-16 may be replaced with any other DCNN models, e.g., ResNet, DenseNet, Inception. Therefore, the size of the proposed model depends on the size of the pre-trained model used and the architecture of the added dense layer(s). Figure 4 shows the detailed schematic architecture of the proposed model with a detailed pre-trained VGG-16 model, plus dense layers for FER. The green section in the figure is the added portion having three fully connected layers in a cascade fashion. The first one is a 'Flatten' layer, which converts the metric into a one-dimensional vector; its task is the only representation and to make it compatible for emotion recognition operation in the next layers, and no operation is performed on the data. The other two layers are densely connected: the first one is a hidden layer that converts the comparatively higherdimensional vector into an intermediate-length vector which is the input of the final layer. The final layer's output is a vector with the size of individual emotional states.
The full model in the same pipeline of pre-trained DCNN and the added dense layers give the opportunity for fine-tuning the dense layers and required few layers of the pre-trained model with emotion data. The pre-trained VGG-16 model shown in Figure 4 consists of five convolutional blocks, and each block has two or three convolutional layers and a pooling layer. The 2D convolutional and pooling operations indicate that the procedures are performed in the 2D image format. Conv Block 1 (i.e., the first block) has two convolutional layers and a MaxPooling layer in a cascade fashion. The output of the layer is the input of Conv Block 2. Suppose the first convolutional layer of Block 1 takes the inputs of size 224 × 224 × 3 for input color image size 224 × 224; after successive convolution and pooling operations in different blocks, the output size of the VGG-16 model is 7 × 7 × 512. The flatten layer converts it into a linear vector of size 25,088 (=7 × 7 × 512), which is the input to the first dense layer. It performs a linear operation and outputs a vector of 1000 length, and this is the input to the final dense layer of length 128. The output of the final dense layer is seven for seven different emotional expressions.  The full model in the same pipeline of pre-trained DCNN and the added dense layers give the opportunity for fine-tuning the dense layers and required few layers of the pretrained model with emotion data. The pre-trained VGG-16 model shown in Figure 4 consists of five convolutional blocks, and each block has two or three convolutional layers and a pooling layer. The 2D convolutional and pooling operations indicate that the procedures are performed in the 2D image format. Conv Block 1 (i.e., the first block) has two convolutional layers and a MaxPooling layer in a cascade fashion. The output of the layer is the input of Conv Block 2. Suppose the first convolutional layer of Block 1 takes the inputs of size 224 × 224 × 3 for input color image size 224 × 224; after successive convolution and pooling operations in different blocks, the output size of the VGG-16 model is 7 × 7 × 512. The flatten layer converts it into a linear vector of size 25,088 (=7 × 7 × 512), which is the input to the first dense layer. It performs a linear operation and outputs a vector of 1000 length, and this is the input to the final dense layer of length 128. The output of the final dense layer is seven for seven different emotional expressions.
Fine-tuning is the most important step in the TL-based FER method, and a carefully   The full model in the same pipeline of pre-trained DCNN and the added dense layers give the opportunity for fine-tuning the dense layers and required few layers of the pretrained model with emotion data. The pre-trained VGG-16 model shown in Figure 4 consists of five convolutional blocks, and each block has two or three convolutional layers and a pooling layer. The 2D convolutional and pooling operations indicate that the procedures are performed in the 2D image format. Conv Block 1 (i.e., the first block) has two convolutional layers and a MaxPooling layer in a cascade fashion. The output of the layer is the input of Conv Block 2. Suppose the first convolutional layer of Block 1 takes the inputs of size 224 × 224 × 3 for input color image size 224 × 224; after successive convolution and pooling operations in different blocks, the output size of the VGG-16 model is 7 × 7 × 512. The flatten layer converts it into a linear vector of size 25,088 (=7 × 7 × 512), which is the input to the first dense layer. It performs a linear operation and outputs a vector of 1000 length, and this is the input to the final dense layer of length 128. The output of the final dense layer is seven for seven different emotional expressions.

VGG-16
Model Training  Fine-tuning is the most important step in the TL-based FER method, and a carefully selected technique is employed to fine-tune the proposed model owing to achieve an improved result. Added dense layer(s) always need to be fine-tuned as the weights of each layer are randomly initialized. On the other hand, a part of (or full) pre-trained model may be considered in the fine-tuning step. Conv Block 5 of VGG-16 model with 'Fine-tune' mark in Figure 4 indicates fine-tuning of this block is essential and fine-tuning of other four blocks is optional thus marked with 'frozen/fine-tune'. The training process of fine-tuning is also important. If the training of both the added layers and the pre-trained model are carried out together, the random weights of dense layers will lead to a poor gradient, Electronics 2021, 10, 1036 9 of 19 which will be propagated through the trained portion, causing deviation from a good result. Therefore, in fine-tuning, a pipeline strategy is employed to train the model with facial emotion dataset: added dense layers are trained first, and then the selected blocks of the pre-trained VGG-16 model are included in the training step by step. To train four different layers of Conv Block 5 of VGG-16 model, fine-tuning increases slowly instead of training all at a time. It helps diminish the effect of initial random weight and keeps track of accuracy. It is worth mentionable that the best outcome with a particular DCNN model may get different fine-tuning options for different datasets depending on the size and other features.
Adam [72] algorithm, the popular optimization algorithm in computer vision and natural language processing applications, is used in the training process of fine-tuning. Adam is derived from two optimization methods: adaptive gradient descent (AdaGrad) which maintains a different learning rate for different parameters of the model [73]; and root mean square propagation (RMSProp) [74], which also maintains different learning rate and it is the average of previous magnitudes. It has two momentum parameters beta1 and beta2, and the user-defined values of the parameters control the learning rate of the algorithm during training. A detailed description of the Adam approach is available in [72].
Image cropping and data augmentation are considered for training the proposed model. Cropped face portion from the image is considered as input in the FER task to enhance facial properties. On the other hand, data augmentation is an effective technique, especially for image data, of making new data from the available data. In this case, new data is generated by rotating, shifting, or flipping the original image. The idea is that if we rotate, shift, scale, or flip the original image, this will still be the same subject, but the image is not the same as before. The process is embedded in the data loader in training. Every time it loads the data from memory, a small transformation is applied to the image to generate slightly different data. As the exact same data is not given to the model, the model is less prone to overfittings. This is very helpful, especially when the dataset is not very large, like the case of FER. With this augmentation, the new cost function of the FER model to considering all images is: where N represents the number of images in the dataset and T is the number for transformation to perform over an image.

Experimental Studies
This section investigates the efficiency of the proposed FER system using TL on DCNN on two benchmark datasets. Firstly, a description of benchmark datasets and experimental setup are presented. Finally, the outcome of the proposed model on the benchmark datasets is compared with some existing methods to verify the effectiveness of the proposed model.

Benchmark Datasets
There are few datasets available for the emotion recognition problem; among those, Karolinska Directed Emotional Faces (KDEF) [75] and Japanese Female Facial Expression (JAFFE) [76] datasets are well-known and considered in this study. Images of the datasets are categorized into seven different emotion classes: Afraid (AF), Angry (AN), Disgusted (DI), Sad (SA), Happy (HA), Surprised (SU), and Neutral (NE). The brief description and selection motivation of the datasets are given below.
The KDEF [75] dataset (also refer as KDEF for simplicity, henceforth) was developed by Karolinska Institute, Department of Clinical Neuroscience, Section of Psychology, Stockholm, Sweden. The dataset images were collected in a lab environment, so the emotion of the participants was artificially created. Specifically, the purpose of the dataset was to use for perception memory emotion attention, and backward masking experiment. Although the primary goal of the material was not emotion classification, it is popular for such a task because medical and psychological issues sometimes related to emotion. The dataset contains 4900 images of 70 individuals, each expressing seven emotional states. Photos of an individual were taken from five different angles, which resemble frontal (i.e., strait) view and four different profile views (full left, half left, full right, and half right). In the angular value variation point of view, images are from −90 • (full left) to +90 • (full right). In a full left or full right profile view, one side of the face with only one eye and ear is visible and makes FER more challenging. Some sample images from the KDEF dataset are shown in Figure 5. FER from the dataset is challenging due to the diversity in images with different profile views along with the frontal view. Profile views mimic the expectation of FER from different angular positions, and therefore, the complete dataset is considered in this study to evaluate the efficiency of the proposed method for such critical cases, which is necessary for industrial applications. Moreover, a few studies are available with the dataset, but they are mostly based on 980 frontal images (e.g., [17,33,60]).
The KDEF [75] dataset (also refer as KDEF for simplicity, henceforth) was developed by Karolinska Institute, Department of Clinical Neuroscience, Section of Psychology, Stockholm, Sweden. The dataset images were collected in a lab environment, so the emotion of the participants was artificially created. Specifically, the purpose of the dataset was to use for perception memory emotion attention, and backward masking experiment. Although the primary goal of the material was not emotion classification, it is popular for such a task because medical and psychological issues sometimes related to emotion. The dataset contains 4900 images of 70 individuals, each expressing seven emotional states. Photos of an individual were taken from five different angles, which resemble frontal (i.e., strait) view and four different profile views (full left, half left, full right, and half right). In the angular value variation point of view, images are from −90° (full left) to +90° (full right). In a full left or full right profile view, one side of the face with only one eye and ear is visible and makes FER more challenging. Some sample images from the KDEF dataset are shown in Figure 5. FER from the dataset is challenging due to the diversity in images with different profile views along with the frontal view. Profile views mimic the expectation of FER from different angular positions, and therefore, the complete dataset is considered in this study to evaluate the efficiency of the proposed method for such critical cases, which is necessary for industrial applications. Moreover, a few studies are available with the dataset, but they are mostly based on 980 frontal images (e.g., [17,33,60]). The JAFFE [76] dataset (or JAFFE for simplicity) contains images of the Japanese female models that were taken at the Psychology Department at Kyushu University. The dataset is also collected in a controlled environment for producing facial expressions. Moreover, this dataset contains local facial variation. The JAFFE dataset is comparatively small in size with only 213 frontal images of 10 individuals; some sample images from it are shown in Figure 6. This dataset is chosen to see how a small dataset responds to training the model. Moreover, a large number of studies used the JAFFE dataset to evaluate FER models (e.g., [45,46,51]).

Afraid
Angry Disgust Happy Neutral Sad Surprised The JAFFE [76] dataset (or JAFFE for simplicity) contains images of the Japanese female models that were taken at the Psychology Department at Kyushu University. The dataset is also collected in a controlled environment for producing facial expressions. Moreover, this dataset contains local facial variation. The JAFFE dataset is comparatively small in size with only 213 frontal images of 10 individuals; some sample images from it are shown in Figure 6. This dataset is chosen to see how a small dataset responds to training the model. Moreover, a large number of studies used the JAFFE dataset to evaluate FER models (e.g., [45,46,51]).

Experimental Setup
OpenCV [77] is used in this study for cropping the face. The images were resized into 224 × 224, which is the default input size of the pre-trained DCNN models. The parameters of the Adam optimizer are considered as learning rate: 0.0005, beta1: 0.9, and beta2: 0.009.

Afraid
Angry Disgust Happy Neutral Sad Surprised

Experimental Setup
OpenCV [77] is used in this study for cropping the face. The images were resized into 224 × 224, which is the default input size of the pre-trained DCNN models. The parameters of the Adam optimizer are considered as learning rate: 0.0005, beta1: 0.9, and beta2: 0.009. On the other hand, we carefully applied only a small amount of augmentation to the data, and augmentation settings are: Rotation: (−10 • to 10 • ), Scaling factor: ×1.1, Horizontal Flip. Applying such a small transformation to the original image is shown to improve the accuracy.
Experiments conducted separating training and test sets in two different modes: (i) 90% of available images of a benchmark dataset (KDEF or JAFFE) are randomly used as training set, and the rest 10% images are reserved as the test set, and (ii) a 10-Fold Cross-Validation (CV). In a 10-Fold CV, the available images are divided into ten equal (or nearly equal) sets, and the outcome is an average of ten individual runs when each time a particular set was considered as a test set while the remaining nine sets are used for the training purpose. Since the aim of any recognition system is to get a proper response to unseen data, test set accuracy is considered as the performance measure. We trained the model in python with Keras [78] and Tensorflow backend. The experiments were conducted on a PC with a CPU of 3.5 GHz, RAM of 16 GB in the Windows environment.

Experimental Results and Analysis
This section investigates the efficiency of the proposed model on the benchmark datasets. Since CNN is the building block of the proposed model, at first, a set of comprehensive experiments is conducted with standard CNN to identify the baseline performance. Then, the effects of different fine-tuning modes are investigated with VGG-16. Finally, the performance of the proposed model is tested with different pre-trained DCNN models. Table 1 presents the test set accuracies for standard CNN with two layers with 3 × 3 size kernel and 2 × 2 MaxPooling for various input sizes from 360 × 360 to 48 × 48 on both KDEF and JAFFE datasets. The size of the test was randomly selected 10% of the available data. The presented results for a particular setting are the best test set accuracies for a total of 50 iterations. It is observed from the table that a larger input image size tends to give better accuracy up to a maximum for both the datasets. As an example, the achieved accuracy on KDEF is 73.87% for an input image size of 360 × 360; whereas, the accuracy is 61.63% for an input size of 48 × 48 for the same dataset. A bigger image has more information, and a system should do well in classifying larger images. However, the best-achieved accuracy did not achieve at the biggest input size (i.e., 360 × 360). Rather, the best accuracies were achieved for both datasets for image size 128 × 128. The reason is that the model is suitable to fit such input size image data, and a larger input size requires more data as well as a larger model to get better performance. The motivation behind the proposed approach is to use the deeper CNN models with TL on a pre-trained model, which minimizes over fittings when trained with a small dataset. As fine-tuning and its mode are important in the proposed TL-based system, experiments were conducted with different fine-tuning modes for better understanding. Table 2 presents the test set (for randomly selected 10% of the data) accuracies of the proposed model with VGG-16 for the different fine-tuning modes on KDEF and JAFFE datasets.
Total training iteration was 50 in any fine-tuning mode; in the case of the entire model, dense layers are trained for 10 iterations first, and the rest 40 iterations are distributed for the addition of Conv blocks of the pre-trained base. For a better understanding of the effectiveness of the proposed TL-based approach, results are also presented for training the whole model from scratch (i.e., with randomly initialized) for 50 iterations. From the table, it is observed that fine-tuning of dense layers and VGG-16 Block 5 is much better than fine-tuning dense layers only, which indicates the fine-tuning of the last block (here Block 5) of the pre-trained model is essential. Again, full VGG-16 Base consideration (i.e., entire model) in fine-tuning is better than considering Block 5 only. The best-achieved accuracies for KDEF and JAFFE datasets are 93.47% and 100%, respectively. On the other hand, training the whole model from scratch shows very low accuracies with respect to TL-based fine-tuning mode, and the achieved test set accuracies are 23.35% and 37.82% for KDEF and JAFFE datasets, respectively. Results presented in the table clearly revealed the proficiency of the proposed TL approach as well as the effectiveness of fine-tuning a portion from the pre-trained DCNN model. To identify the best suited model, the proposed approach is investigated for eight different pre-trained DCNN models: VGG-16, VGG-19-BN, ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3 and DenseNet-161. The number with the model's name represents the depth of that model; therefore, the selected models are diverse with varying depth sizes. The experiments conducted for both randomly selected 10% data as a test (i.e., 90% as the training set) and 10-Fold CV; and test set accuracies with 50 iterations in fine-tuning are presented in Table 3. For the JAFFE dataset, 100% accuracy is achieved by all the models on the selected 10% test data case and the accuracy varied from 97.62% to 99.52% for the 10-Fold CV case. On the other hand, for the KDEF dataset, accuracy varied from 93.47% to 98.78% on 10% of selected test data and 93.02% to 96.51% for the 10-Fold CV case. The size of the KDEF dataset is much larger than JAFFE as well as contains profile view images; therefore, slightly lower accuracy than JAFFE is logical. It is remarkable from Table 3 that a relatively deeper model performed better. For example, ResNet-152 is always better than its less deep model ResNet-18; the models achieved test set accuracies on KDEF for 10-Fold CV case 96.18% and 93.98%, respectively. Among the considered pre-trained DCNN models, DenseNet-161 is the deepest model and outperformed other models for both datasets. In 10 runs for 10-Fold CV, the model misclassified only six samples among selected 490 test samples (and shown an accuracy of (490 − 6)/490 = 98.78%) for KDEF and it misclassified only one sample case on JAFFE. Table 4 shows the emotion category-wise classification on 490 (=7 × 70) test images of the KDEF dataset by DenseNet-161. Three images in afraid are misclassified as surprised; two images in surprised are misclassified as afraid, and one image in sad is misclassified as disgusted.  Table 5 shows the images which are misclassified by DenseNet-161 and analyzed for better realization of proficiency of the proposed approach. All six misclassified images from KDEF are profile views, and three (sl. 2, 3, and 4) of them are full right views, which complicated the recognition. For the first and the second images from KDEF, the original expression is afraid but predicted as surprised. In both images, the mouth is well extended and open, just like a surprised face. In addition, the widened eye is a feature of surprise. It is difficult even for humans to identify expressions of afraid from facial images. Alternatively, the misclassification of the third image as disgust is logical as it is a profile view (i.e., only one side on the face is visible), so the expression appears to be disgust. The eyes in this image look shrunken, which is similar to disgust. Though the mouth is too exaggerated to be classed as sad, the expression is almost indistinguishable by an algorithm as well as by humans. The remaining three KDEF images share similar cases of misclassification. Finally, the only misclassified image from JAFEE is also difficult to recognize as afraid, while the open mouth looks like surprised.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understand-

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understand-

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis.

Results Comparison with Existing Methods
This section compares the performance of the proposed FER method with the prominent existing methods on emotion recognition using the KDEF and JAFFE datasets. Along with test set recognition accuracy, training and test data separation and distinguished properties of the individual methods are also presented in Table 6 for better understanding. Both classical methods and deep learning-based methods are included in the analysis. Most of the existing methods use the JAFFE dataset; the dataset is relatively small in size with only 213 samples, and eventually, few methods considered 210 samples. On the other hand, KDEF dataset is relatively large, with 4900 images containing both frontal and profile views. Only a few recent studies used this dataset, but they selected only 980 frontal images [17,34,35], or a smaller number of images [52] rather than the complete dataset. It is noteworthy that images with only frontal views are easy to classify than images with both frontal and profile views. Different strategies for separating training and test samples were used in existing studies as listed up in the table. Moreover, each individual method's significance with technique (used in feature selection and classification) presented in the comparison table is helpful to understand the proficiency of the techniques. The proposed method with DenseNet-161 is considered for performance comparison in Table 6 as it showed the best accuracy (in Table 3) among the considered eight DCNN models. It is observed from Table 6 that the proposed method outperformed any conventional feature-based method for both KDEF and JAFFE datasets. For JAFFE, among the feature-based methods, the pioneering work incorporating face recognition [48] is still shown to have achieved the best recognition accuracy; the achieved recognition accuracy is 98.98% for equal training and test set division. In the 10-Fold CV case, the method with feature representation using DWT with 2D-LDA and classification using SVM [49] shows the best accuracy of 95.70%. On the other hand, the proposed method achieved an accuracy of 99.52% in 10-Fold CV on JAFFE, which is much better than any other feature-based method; moreover, the accuracy is 100% on randomly selected 10% test samples. In regard of KDEF dataset, the proposed method achieved an accuracy of 98.78% (on randomly selected 10% test samples) considering all 4900 samples and outperformed the existing methods. Notably, accuracy on 10% test samples is 82.40% by [17] considering only selected 980 frontal images, and the efficiency is inferior to the proposed method.
Since the proposed FER is based on the DCNN model through TL, its performance comparison with other deep learning methods, especially CNN-based methods, is more appropriate. The work with SCAE plus CNN [36] shows an accuracy of 92.52% on the KDEF dataset while considering frontal images only. The hybrid CNN and RNN method [56] shows an accuracy of 94.91% on the JAFFE dataset. On the other hand, an accuracy of 98.63% is shown by the method with facial image enhancement using contrast limited adaptive histogram equalization (CLAHE) algorithm, feature extraction using DWT, and then classification using CNN [24]. According to the achieved performance, the proposed method outperformed any other deep learning-based method and revealed the effectiveness of the proposed TL-based approach for FER.

Discussion
Emotion recognition from facial images in an uncontrolled environment (e.g., public places), where it is not always possible to acquire frontal view images, is becoming important nowadays for a secure and safe life, smart living, and a smart society. Towards this goal, a robust FER is essential, where emotion recognition from diverse facial views, especially views from various angles, is possible. Profile views from various angles do not show landmark features of the frontal view, and the traditional feature extraction methods are unable to extract facial expression features from the profile views. Therefore, FER from the high-resolution facial image using the DCNN model is considered as the only option to address such a challenging task. The TL-based approach is considered in the proposed FER system: a pre-trained DCNN is made compatible with FER replacing its upper layers with the dense layer(s) to fine-tune the model with facial emotion data. The pipeline training strategy in fine-tuning is the distinguishable feature of the proposed method: the dense layers are tuned first, followed by tuning other DCNN blocks successively.
The proposed method has shown remarkable performance in evaluation on the benchmark datasets with both frontal and profile views. The JAFFE dataset contains frontal views only, and the KDEF dataset contains profile views taken from four different angles along with frontal views. In the full left/right profile view of KDEF, one side of the face with only one eye and ear is visible; thus, the recognition task becomes complex. We felt that the experiments on the two diverse datasets are adequate for proficiency justification, and the proposed method is expected to perform well on other datasets. However, datasets with low-resolution images or with the highly imbalanced case will need additional preprocessing and appropriate modification in the method, which remains a subject for future study. Furthermore, working with images from the uncontrolled environment or video sequences also remains a future study.
As generalization ability (performance on unseen data) is an essential attribute in the machine learning paradigm, the test set concept (samples that are not used in any training step) is used to validate the proposed model. A fixed number of samples reservation from available data and cross-validation (reserve all the samples in the round) are the two popular ways to maintain the test set. Both the methods were considered in the present study, while it is common to follow anyone. The test set was only used for the final validation purposes of the proposed model. The proposed method has outperformed the existing ones based on the achieved test set accuracies. It is noteworthy that the proposed method misclassified only a few images with confusing views, and the overall recognition accuracy remains remarkably high. Therefore, the method proposed in this paper is promising for a practical scenario where the classification of non-frontal or angularly taken images is prevailing.
The parameters' value selection is a fundamental task for any machine learning system. In the proposed FER model, only the upper dense layers of the pre-trained deep CNN are replaced by some appropriate layers. The hyperparameters (e.g., the number of dense layers, neurons in each layer, and fine-tuning learning parameters) were chosen based on several trials stressing the pipeline training issue. There is a scope to further optimizing every parameter of a particular DCNN model for each dataset that might enhance the performance of the proposed method.

Conclusions
In this study, an efficient DCNN using TL with pipeline tuning strategy has been proposed for emotion recognition from facial images. According to the experimental results, using eight different pre-trained DCNN models on well-known KDEF and JAFFE emotion datasets with different profile views, the proposed method shows very high recognition accuracy. In the present study, experiments conducted with general settings regardless of the pre-trained DCNN model for simplicity and a few confusing facial images, mostly profile views, are misclassified. Further fine-tuning hyperparameters of individual pre-trained models and extending special attention to profile views might enhance the classification accuracy. The current research, especially the performance with profile views, will be compatible with broader real-life industry applications, such as monitoring patients in the hospital or surveillance security. Moreover, the idea of facial emotion recognition may be extended to emotion recognition from speech or body movements to cover emerging industrial applications.