Facial Expression Recognition Using Pre-trained Architectures †

: In the area of computer vision, one of the most difﬁcult and challenging tasks is facial emotion recognition. Facial expression recognition (FER) stands out as a pivotal focus within computer vision research, with applications in various domains such as emotion analysis, mental health assessment, and human–computer interaction. In this study, we explore the effectiveness of ensemble methods that combine pre-trained deep learning architectures, speciﬁcally AlexNet, ResNet50, and Inception V3, to enhance FER performance on the FER2013 dataset. The results from this study offer insights into the potential advantages of ensemble-based approaches for FER, demonstrating that combining pre-trained architectures can yield superior recognition outcomes


Introduction
Facial expression recognition (FER) is gaining more importance in computer vision these days.It is used for the analysis and classification of a given facial expression.FER can be used in fields like robotics, security, driving assistance, mental health disorder prediction, lie detectors, etc. [1,2].With the progress of deep learning, FER technology achieves a remarkable increase in recognition accuracy compared to traditional methods.
Because of its capacity to extract image information, Convolutional Neural Networks (CNNs) have been widely employed for image classification tasks, particularly in FER.However, there may be certain difficulties in training a CNN model for FER.
Overfitting on uncertain inputs, for example, may result in mislabeled outputs.Furthermore, a high percentage of inaccurate labels during the early stages of optimization can prevent the model from converging.Transfer learning is one of the deep learning methods that has gained much attention in the field.It employs a pre-trained CNN to solve an issue that is similar to one that the CNN was trained to solve in the first place.Pre-trained models are commonly utilized in FER research [3][4][5].

Related Works
In the last decade, researchers have turned to deep learning instead of machine learning because of its great automatic recognition capability.This section describes several known studies in FER utilizing deep learning.
Yolcu et al. [3] proposed the detection of essential parts of the face using three CNNs of the same architecture to detect the eyebrow, mouth, and eye.They developed a system for monitoring neurological disorders using facial expressions.The experiment conducted on the RafD dataset reports an accuracy of 94.44%.Li et al. [4] introduced a new CNN for facial occlusion problems.VGGNet network and a CNN called ACNN are used to train two databases, AffectNet and RAF-DB.A recognition accuracy of 80.54% and 54.84% is reported on RAF-DB and AffectNet, respectively.Zahara et al. [5] used CNN based on Raspberry Pi for facial expression recognition.The experiment reported a recognition accuracy of 65.97% on the FEI 2013 dataset.It includes face detection, feature extraction, and face emotion recognition.In [6], pre-trained architectures like VGG, inception, and ResNet are experimented with to find emotion recognition on the FER2013 dataset.A maximum accuracy of 75.2% was reported on ensembles of modern deep CNNs.Pre-trained architectures GoogleNet and AlexNet were examined for emotion recognition in [7].GoogleNet reported a maximum accuracy of 65.20% on the FER2013 dataset.Liu et al. [8] used several subnets, each of which is a compact CNN model for emotion recognition.A single subnet achieved an accuracy of 62.44% of the FER2013 dataset.
An ensemble-based approach using AlexNet, ResNet, and VGG16 on the FER2013 dataset is employed in [9] to achieve an accuracy of 71.27%.SVM is used for classification.Facial expression recognition using deep architectures is presented in [10,11].

Pre-trained Architectures
AlexNet, a neural network architecture in the field of deep learning, marked a significant milestone in the advancement of computer vision tasks, particularly image classification.AlexNet's architecture consisted of eight layers, with five convolutional layers followed by three fully connected layers.
Google's Inception v3, a convolutional neural network (CNN), is an example of a sophisticated architecture created to tackle image classification problems.The inception modules form the foundation of the architecture of Inception v3.By using alternative kernel sizes for concurrent convolutional operations, these modules enable the network to simultaneously record features at various scales.Through the integration of multiple parallel pathways, Inception v3 improves its recognition of complex structures and patterns in images.
The key innovation of ResNet50 lies in its approach to deep learning through residual connections.Unlike traditional architectures, ResNet introduces skip connections or shortcuts, allowing the network to learn residual functions.This alleviates the vanishing gradient problem and facilitates the training of extremely deep networks by enabling the direct flow of information through the network.

Methodology
This study uses the concept of transfer learning.In transfer learning, a CNN trained for one application can be reused for another application.Building deep learning models from the bottom requires high resources and a large amount of data.This can be minimized by using the concept of transfer learning.CNN is tested in three ways: First, features extracted from all pre-trained models by removing the fully connected classification layers are fed to SVM for classification.In the second approach, all the other layers in pre-trained networks are frozen, and only the SoftMax layer is changed to 7. In the third approach, an ensemble-based approach using model averaging is used for prediction.The proposed methodology adopted in this work is shown in Figure 1.

Experiments
The model is made in Python, leveraging the deep learning framework TensorFlow, with Keras as a pivotal element.The experimental arrangement is conducted on hardware equipped with an Intel Core i5 processor and 8 GB of RAM.

FER2013 [Facial Expression
Recognition 2013], a publicly available facial expression dataset, is used to train the model [12].The database exclusively comprises grayscale images, all of which have been size-normalized to dimensions of 48×48 pixels.This dataset encompasses around 30,000 images, each associated with one of seven distinct emotions: happiness, anger, surprise, fear, disgust, sadness, and neutrality.The images include both posed and unposed headshots.Sample images from the FER2013 database are shown in Figure 2.

Face Detection Using Haar Cascade
Face detection using a Haar cascade is an efficient method proposed by [13] for detecting human faces from real-time video or images.It is an approach where a cascade function is used to train a large number of positive and negative images, and it is then used to detect images later.The algorithm uses features detected from the edge or line proposed by [13].Face detection is first performed from live video using the Haar cascade for validation work.From the detected face, expressions are evaluated using the trained model.

Results and Discussion
Due to the small and unbalanced nature of the FER2013 dataset, the application of transfer learning can enhance the model's accuracy.Pre-trained models based on transfer

Experiments
The model is made in Python, leveraging the deep learning framework TensorFlow, with Keras as a pivotal element.The experimental arrangement is conducted on hardware equipped with an Intel Core i5 processor and 8 GB of RAM.

FER2013 [Facial Expression
Recognition 2013], a publicly available facial expression dataset, is used to train the model [12].The database exclusively comprises grayscale images, all of which have been size-normalized to dimensions of 48 × 48 pixels.This dataset encompasses around 30,000 images, each associated with one of seven distinct emotions: happiness, anger, surprise, fear, disgust, sadness, and neutrality.The images include both posed and unposed headshots.Sample images from the FER2013 database are shown in Figure 2.

Experiments
The model is made in Python, leveraging the deep learning framework TensorFlow, with Keras as a pivotal element.The experimental arrangement is conducted on hardware equipped with an Intel Core i5 processor and 8 GB of RAM.

FER2013 [Facial Expression
Recognition 2013], a publicly available facial expression dataset, is used to train the model [12].The database exclusively comprises grayscale images, all of which have been size-normalized to dimensions of 48×48 pixels.This dataset encompasses around 30,000 images, each associated with one of seven distinct emotions: happiness, anger, surprise, fear, disgust, sadness, and neutrality.The images include both posed and unposed headshots.Sample images from the FER2013 database are shown in Figure 2.

Face Detection Using Haar Cascade
Face detection using a Haar cascade is an efficient method proposed by [13] for detecting human faces from real-time video or images.It is an approach where a cascade function is used to train a large number of positive and negative images, and it is then used to detect images later.The algorithm uses features detected from the edge or line proposed by [13].Face detection is first performed from live video using the Haar cascade for validation work.From the detected face, expressions are evaluated using the trained model.

Results and Discussion
Due to the small and unbalanced nature of the FER2013 dataset, the application of transfer learning can enhance the model's accuracy.Pre-trained models based on transfer

Face Detection Using Haar Cascade
Face detection using a Haar cascade is an efficient method proposed by [13] for detecting human faces from real-time video or images.It is an approach where a cascade function is used to train a large number of positive and negative images, and it is then used to detect images later.The algorithm uses features detected from the edge or line proposed by [13].Face detection is first performed from live video using the Haar cascade for validation work.From the detected face, expressions are evaluated using the trained model.

Results and Discussion
Due to the small and unbalanced nature of the FER2013 dataset, the application of transfer learning can enhance the model's accuracy.Pre-trained models based on transfer learning were explored.Initially, a preprocessing step is imple×mented to resize all images in the FER2013 dataset to color images.
Each of the three models has undergone distinct rescaling procedures owing to variations in their required input image sizes.The AlexNet, ResNet50, and InceptionV3 models necessitate input images sized at 227 × 227, 224 × 224, and 139 × 139 (Figure 3), respectively.To accommodate the design specifications of models that expect images with three input channels, a grayscale image comprising only one channel is expanded by duplicating the grayscale information across the remaining two channels.This modification ensures that the image format aligns with the requirements of all three models.Following this, zero-mean normalization is applied to standardize inputs for each mini-batch, promoting a stable learning process in subsequent layers.Across all models, the stochastic gradient descent (SGD) optimizer was utilized, with a learning rate set at 0.0001.Batch sizes were defined as 128, and the training process extended to 100 epochs.learning were explored.Initially, a preprocessing step is imple×mented to resize all images in the FER2013 dataset to color images.
Each of the three models has undergone distinct rescaling procedures owing to variations in their required input image sizes.The AlexNet, ResNet50, and InceptionV3 models necessitate input images sized at 227 × 227, 224 × 224, and 139 × 139 (Figure 3), respectively.To accommodate the design specifications of models that expect images with three input channels, a grayscale image comprising only one channel is expanded by duplicating the grayscale information across the remaining two channels.This modification ensures that the image format aligns with the requirements of all three models.Following this, zero-mean normalization is applied to standardize inputs for each mini-batch, promoting a stable learning process in subsequent layers.Across all models, the stochastic gradient descent (SGD) optimizer was utilized, with a learning rate set at 0.0001.Batch sizes were defined as 128, and the training process extended to 100 epochs.Table 1 shows Rank1 accuracy using different models using data with and without augmentation.Figure 4 shows the output of expression recognition using sample inputs, and Figure 5 shows expression recognition using live video.Table 1 shows Rank1 accuracy using different models using data with and without augmentation.Figure 4 shows the output of expression recognition using sample inputs, and Figure 5 shows expression recognition using live video.learning were explored.Initially, a preprocessing step is imple×mented to resize all images in the FER2013 dataset to color images.
Each of the three models has undergone distinct rescaling procedures owing to variations in their required input image sizes.The AlexNet, ResNet50, and InceptionV3 models necessitate input images sized at 227 × 227, 224 × 224, and 139 × 139 (Figure 3), respectively.To accommodate the design specifications of models that expect images with three input channels, a grayscale image comprising only one channel is expanded by duplicating the grayscale information across the remaining two channels.This modification ensures that the image format aligns with the requirements of all three models.Following this, zero-mean normalization is applied to standardize inputs for each mini-batch, promoting a stable learning process in subsequent layers.Across all models, the stochastic gradient descent (SGD) optimizer was utilized, with a learning rate set at 0.0001.Batch sizes were defined as 128, and the training process extended to 100 epochs.Table 1 shows Rank1 accuracy using different models using data with and without augmentation.Figure 4 shows the output of expression recognition using sample inputs, and Figure 5 shows expression recognition using live video.Extensive research in deep learning was conducted using the FER2013 dataset, and Table 2 compares the current study and previous research on the same dataset.
Table 2. Comparison of accuracy using the FER2013 dataset with existing methods.

Method
Accuracy Rate CNN [8] 62.44% Extensive research in deep learning was conducted using the FER2013 dataset, and Table 2 compares the current study and previous research on the same dataset.

Figure 4 .
Figure 4. Expression recognition on sample input images.

Figure 4 .
Figure 4. Expression recognition on sample input images.Figure 4. Expression recognition on sample input images.

Figure 4 . 6 Figure 5 .
Figure 4. Expression recognition on sample input images.Figure 4. Expression recognition on sample input images.

Table 1 .
Accuracy using different methods.

Table 1 .
Accuracy using different methods.

Table 1 .
Accuracy using different methods.