1. Introduction
Facial expressions play a very significant role in nonverbal communication. Nonverbal cues can be categorised as facial expressions of a non-communicative nature. It is natural and reflects not only emotions, but also several mental activities, physical gestures, and social interactions [
1]. Facial expression recognition (FER) is widely used in several applications, including customer satisfaction recognition, human–computer interaction, medical diagnostics (disease), elderly care, criminal justice systems, security monitoring, smart card applications, and increased law enforcement services in smart cities [
2,
3,
4]. Vision sensor-based FER has attracted attention in current research and has great potential in real-time FER recognition. In vision-based FER, the researchers focused on seven basic expressions, i.e., anger, disgust, fear, happy, neutral, sad, and surprise [
5,
6], and they categorised the FER into two sub-categories as conventional and deep learning (DL)-based methods.
Conventional FER systems comprise three major steps: face detection, feature extraction, and classification [
7]. Several researchers have used conventional feature extraction mechanisms based on clustering methods [
8], such as Local Binary Patterns (LBP) [
9], principal component analysis (PCA) [
10], Histogram-Oriented Gradient (HoG) [
11], Oriented FAST and Rotated BRIEF (ORB) [
4], feature-level fusion techniques [
12], etc. Afterward, the extracted features are fed to the classifiers for classification, such as K-nearest neighbours (KNN) [
13], Hidden Markov Models (HMM) [
14], Support Vector Machines (SVM) [
15], Decision trees, and Naïve Bayes [
16]. In conventional vision-based FER, independent feature extraction and classification are the major concerns, require domain experts for prominent feature selection and classification, and are time-consuming and error-prone techniques [
17], making it challenging to improve the system performance of conventional FER. Therefore, the researcher investigated DL-based strategies for FER, providing comparatively better accuracy.
Inspired by the recent performance of DL approaches, several researchers used CNN-based methods in different domains, such as fire disaster [
18], time-series analysis [
19,
20], medical image analysis [
21], video analysis [
22], photovoltaics [
23], sentiment analysis from text data [
24], and energy management [
20,
25,
26], and they achieved promising results. In recent years, DL-based methods have shown promising results for FER over conventional methods by blending end-to-end automatic feature extraction and classification into one step [
27,
28]. In particular, convolutional neural networks (CNNs) have been used in several research studies to address the limitations of conventional FER [
29,
30]. Therefore, the researcher used different CNN models for FER, such as the ensemble convolutional neural network (ECNN) for FER used in [
31], VGG [
5,
32], AlexNet [
33], ResNet50 [
34], and Xception [
35]. These methods improve the performance over conventional FER; however, the accuracy of FER needs further improvement, and the time complexity, model size, inferencing speed, and performance on unseen data restrict the system from real-world implementation; as such, an efficient and accurate model has yet to be developed.
In this paper, we propose an improved CNN-based architecture for FER and improve the performance of FER to increase its usability in several applications, such as human–computer interaction, customer reviews, and elderly care, and especially to increase law enforcement services in smart cities. We used the Viola–Jones (VJ) face detection algorithm, which was created based on considerable research into facial recognition and detection and which can segment and recognise elements such as the nose, mouth, and eyes [
36]. The detected faces were passed to our proposed model for FER. The proposed model is lightweight and can be deployed over a cost-effective, resource-constrained device. The performance of the proposed model was evaluated using three benchmark datasets to check the model’s effectiveness in a real-world environment. The key contributions of our work are summarised as follows:
- We propose an efficient framework for FER that can be deployed over resource-constrained hardware to identify and monitor suspicious activities that can assist law enforcement agencies in smart cities by providing a cost-effective solution to ensure better security. 
- The proposed framework is based on a lightweight CNN model for FER and uses the VJ algorithm for face detection. 
- We performed cross-validation of the proposed model to fully assess its generalisation abilities. 
- The performance of the proposed model is evaluated on several benchmark datasets and the results reveal significant improvements in accuracy compared to state-of-the-art approaches. 
The rest of the work is presented as follows: 
Section 2 describes related work, 
Section 3 explains the methodology, 
Section 4 presents the results, and 
Section 5 concludes the paper.
  2. Related Work
FER is currently the subject of considerable active research, and several cutting-edge techniques have been proposed over the past two decades. However, due to the infinitely varied level of facial features in people of different ages, cultures, genders, scales, and perspectives, the procedure requires techniques of better accuracy. The relevant literature includes numerous studies on the use of facial expressions to identify feelings and emotions. Several researchers have proposed different techniques based on conventional and DL-based methods. However, conventional vision-based FER methods have achieved limited performance in extracting features from the given input images and classifying them accordingly. For instance, Kumar et al. [
8] proposed a three-tier framework for FER. In the first tier, Otsu’s thresholding approach is used to remove the background using the YCbCr colour scheme; in the second tier, the max–min algorithms are used to select the initial cluster values of K-means algorithms to segment the most important regions from the image nose, mouth, forehead, eye gap, and eye; and in the third tier, different shape features are extracted from these segmented regions and fed into a two-level rule-based classifier for FER. Shan et al. [
9] employed LBP and discriminant LBP features for the given input images and fed them to the SVM classifier for effective FER. Mansour et al. applied a PCA-based method for efficient FER [
10], while Kumar et al. [
11] presented a real-time system for FER. The authors used the HoG features descriptor to extract the most prominent facial features and the SVM classifier to differentiate the extracted features into seven different emotions. Sajjad et al. [
4] developed an FER-based system for detecting suspicious activity. The authors used the VJ algorithm for face detection; the detected face was then preprocessed by the median and Gabor filters. The ORB features were then extracted and the SVM classifier was trained to classify the seven basic emotions. Wang et al. [
14] used geometric LBP and Gabor feature descriptors for salient facial feature extraction; the extracted features were then classified by HMM. Abdulrahman et al. [
15] employed the PCA and LBP feature extraction mechanism, using SVM as a classifier to differentiate the extracted features in the seven basic emotions. They used a VJ algorithm for face detection, a Supervised Descent Method to extract prominent features from the detected face, and a decision tree algorithm to classify the seven basic emotions. In conventional vision-based FER, independent feature extraction and classification are the major concerns, require domain experts for prominent feature selection and classification, and are time-consuming and error-prone techniques.
DL-based methods are applied to overcome the challenges of a conventional vision-based system for FER. Numerous research studies have been done to examine FER, and some of the most recent work has focused on developing an effective and efficient training model. For instance, Mayya et al. [
37] presented an approach for FER using DCNN features. The authors employed a pretrained DCNN model, which used the pretrained weight of ImageNet [
38], and obtained a 9216-dimensional vector for validation with SVM to recognise the seven basic emotions. Sajjad et al. [
30] proposed an FER method for behaviour analysis by considering some serious famous TV videos. In this approach, the VJ algorithm is used for face detection, and then Hog, SVM, and a CNN model are used for features extraction and classification. Al-Shabi et al. [
7] used a hybrid model for FER. They fused the features of SIFT and CNN for facial feature analysis. Yu et al. [
39] investigated the performance of CNNs for FER by employing an assembly of CNNs with five convolutional layers. The authors used a stochastic pooling strategy instead of maximum pooling to achieve better performance. Jain et al. [
40] proposed a new DL model containing deep residual blocks and convolution layers for accurate FER. Some DL models, such as the VGG, AlexNet, four-layer CNN, ResNet, and MobileNet, are used in [
5,
27,
33,
34,
41,
42] for accurate FER. However, the time complexity, model size, inferencing speed, and performance on unseen data restrict these systems from real-world implementation; therefore, an efficient and accurate model has yet to be developed. The framework proposed in this paper takes advantage of techniques used to address concerns about increased computational costs and feature extraction from low-resolution images in poor-quality scenarios.
  4. Results and Discussion
Experimental results were obtained from three benchmark datasets: FER-2013, CK+, and KDEF. The datasets were divided into training, testing, and validation data, where 60% of data were selected for training, 20% for testing, and 20% for model validation. We followed a state-of-the-art method to split the dataset between training, testing, and validation [
46]. Before choosing these percentages, we also tested the proposed model over several variants of data splitting, meaning that the proposed model could effectively learn with a lower amount of data. We conducted experiments during the training and testing processes to determine the dramatic changes that occurred in the performance of the proposed system. To obtain streams from VSN [
22] for experimental evaluation, we used Python 3.64, OpenCV3+, Keras, and TensorFlow with resource-constrained devices. We used a GTX 1070 GPU with 8 GB of onboard memory and an intel Core i5 processor with 16 GB of RAM to train the model on a system comprising 8 GB of memory, a 2.8 GHz processor, and a 1 terabyte (TB) installed hard drive. 
Table 2 shows the detailed specifications of the system and important libraries.
  4.1. Experimental Evaluation
In this section, we evaluate the performance of the proposed model over benchmark datasets. The detailed experimental results for each dataset are described in the following subsections:
  4.1.1. Performance Evaluation over FER-2013
Figure 3 shows the accuracy level of our proposed CNN model on FER-2013 during training and validation. The x-axis shows the number of epochs, while the y-axis shows the accuracy of the proposed model concerning training and validation. We set 30 epochs as the standard for model training, and the ratio of accuracy is listed on the y-axis. The validation accuracy of the proposed method on the FER-2013 started at 0.2%, whereas the accuracy of training started at 0.5%. After each epoch, the accuracy of training and validation decreases, and after several epochs, these accuracies become stable. Over 30 epochs, the training and validation accuracy reached 89%, respectively, indicating that the proposed model fits with the data variation. The testing accuracy of each class was 77.78% for anger, 81.50% for disgust, 85.86% for fear, 93.33% for happy, 95% for neutral, 93% for sad, and 90.44% for surprise. The overall testing accuracy of our model using FER-2013 was 89%, and the confusion of all classes is given in 
Figure 4.
   4.1.2. Cross-Validation of the FER-2013 Trained Model over the KDEF and CK+ Datasets
We also evaluated the performance of the proposed model using a cross-crop evaluation matrix, where the model is trained over the FER-2013 dataset and validated over KDEF and CK+ datasets. 
Table 3 and 
Table 4 present the detailed results. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.
  4.1.3. Performance Evaluation over CK+
To evaluate the performance of the proposed model using the CK+ dataset, we experimented with the same number of epochs as previously used for the FER-2013 dataset. The training and accuracy of our model using the CK+ dataset rose from 71% and 90%, as shown in 
Figure 5, and the accuracy of training and validation reached 92% and 89%, respectively. The testing accuracy is shown in the confusion matrix, as given in 
Figure 6, which indicates that the overall testing accuracy of the proposed system on CK+ is 90.98%. The proposed model achieved 77.57%, 85%, 88%, 98.31%, 99%, 99%, and 90%, respectively, for the anger, disgust, fear, happy, neutral, sad, and surprise classes.
  4.1.4. Cross-Validation of CK+ Trained Model over FER-2013 and KDFE Dataset
The performance of the CK+ dataset-trained model was cross-validated on FER-2013 and CK+ datasets. We selected test samples from each class of the mentioned datasets and performed a cross-validation to verify the robustness of the model. 
Table 5 and 
Table 6 present the results. 
  4.1.5. Performance Evaluation of KDEF
Figure 7 illustrates the results of the KDEF dataset in terms of validation and training accuracy, which started at 0.5%. After each epoch, the accuracy gradually increased because of the learning parameters programmed into the machine. Finally, over 30 epochs, the accuracy of training and validation reached 94% and 93%, respectively. The KDEF final confusion matrix indicated that the overall testing accuracy of the proposed model was 94.04%. Anger is identified at an accuracy rate of 91.89%, disgust at 91.50%, fear at 92%, happy at 97%, neutral at 95.89%, sad at 94.28 %, and surprise at 95.78%, as shown in 
Figure 8.
   4.1.6. Cross-Validation of the KDEF-Trained Model over the FER-2013 and CK+ Datasets
The trained model over the KDEF dataset was cross-validated with the FER-2013 and CK+ datasets, and the detailed results are given in 
Table 7 and 
Table 8. Cross-crop validation was performed over a test set of data to check the generalisation ability of the proposed model over unseen data.
To conclude our analysis of all the above-mentioned results, the proposed model was trained on FER-2013, CK+, and KDEF datasets individually, and their performance was validated. Furthermore, each model was cross-validated on the other two datasets to fully assess the generalisation ability of the proposed model.
Based on the above-mentioned results, the proposed model achieved remarkable accuracy over each dataset; however, the performance of the proposed model over the CK+ dataset achieved lower accuracy against the FER-2013, and the FER-2013 achieved lower accuracy against the KDEF. The performance of the proposed model over the CK+ dataset achieved lower accuracy against FER-2013 due to unbalanced samples in FER-2013 datasets when the model was trained over a balanced CK+ dataset. Furthermore, the proposed model achieved lower accuracy in cross-crop validation when the model was trained over FER-2013 and validated over KDEF. The main reason behind the lower performance of the model is that KDEF is an RGB dataset, while the FER-2013 samples were in grayscale.
  4.2. Comparative Analysis of the Proposed Model with State-of-the-Art Techniques
We conducted several experiments to evaluate the performance of the proposed FER model with other state-of-the-art methods, as shown in 
Table 9. To evaluate the model’s robustness, we first compared the performance of our model with FER-2013; the proposed model surpassed an accuracy of 23.2%, 17.2%, 0.6%, and 2.42%, respectively, compared to the models of Arriaga et al. [
51], J. Li et al. [
52], Subramanian et al. [
53], and Borgalli et al. [
46]. We also assessed the robustness of our model using the CK+ dataset, where our model achieved a promising result compared to state-of-the-art methods. The proposed model achieved 1.48%, 9.98%, and 3.29% higher accuracy compared with those of Hasani et al. [
54], Borgalli et al. [
46], and Bodapati et al. [
6], respectively. We then assessed the proposed model using the KDEF dataset. Our model obtained a higher accuracy of 0.65 when compared with the method proposed by Sajjad et al. [
30]. To further analyse the model, Haq et al. [
55] and Liu et al. [
56] achieved 0.39% and 7.14% lower performance compared to the proposed model. 
  4.3. Time Complexity of the Proposed Model over GPU, CPU, and Resource-Constrained Devices
We evaluated the performance of a proposed model in real time to compute the processing time of the proposed model over GPU, CPU and resource-constrained device (Jetson Nano). Jetson Nano is a small and powerful computer that runs multiple CNNs in parallel for different applications, such as recognition, segmentation, object detection, and speech processing. Its GPU has 128 NVIDIA CUDA® cores, the CPU is Quad-core ARM Cortex-A57, and it has 4 GB of memory. The frames per second (fps) of the proposed model using GPU, CPU, and Jetson Nano were 45, 21, and 26 s, respectively. The time complexity of the proposed model is much lower and applicable for deployment in real-world scenarios.