1. Introduction
Facial expressions are a fundamental aspect of human communication that convey emotions, intentions, and reactions. In educational environments, the ability to accurately perceive and respond to students’ emotional states can significantly impact the students’ learning experience. Traditional facial expression recognition systems have limitations in handling complex expressions and varying lighting conditions. Research on these environments has received widespread attention in multiple disciplines, including computer science, psychology, architecture, and education [
1]. Usually, the learning flow in these environments depends on the learner’s mental responses based on solving tests and answering exam questions, enabling the next level of the learning process to be reached. However, conventional methods do not consider the emotional behavior of the learner during the learning process. Emotional behavior is an important factor in the quality of the learning process and the success of the desired learning outcomes. A dashboard can support emotion detection during online learning [
2]. The learner’s interactions with the learning environment can be classified into two categories. The first consists of mental responses to test questions, and the second consists of emotional responses through facial expressions. Therefore, integrating both types of responses is critical for developing more adaptive and intelligent computer-based learning environments [
3]. Due to the importance of considering the learner’s emotions during the learning process, other research efforts have focused on modelling these emotions by interpreting facial expressions using machine learning algorithms [
4,
5,
6,
7,
8,
9]. Furthermore, researchers [
10,
11] have tried to create models for studying the human mind, including analyzing various behaviors such as what kind of person someone is or their likes and dislikes. For example, in terms of applications, the benefit of experimenting on an FER dataset that can be adopted in the education sector is the ability to assess student satisfaction with the quality of learning when they are learning online using tools such as Zoom and Google Meets. However, facial expression recognition (FER) is a significant area of research with applications in the education field. Understanding students’ emotional states and engagement levels can greatly enhance the environmental learning process.
Research has explored the application of facial expression recognition to applied techniques such as those based on machine learning and deep learning [
12]. Deep learning performance mostly depends on the image resolution [
13,
14,
15], and preprocessing data during feature extraction and feature selection are very important for FER classification [
16]. However, when a person is portraying a neutral expression, their internal feelings or hidden messages can be reflected on their face, and observers can perceive those feelings. The six most universal emotions are sadness, fear, happiness, disgust, surprise, and angry [
17]. However, facial expressions and the emotional interpretations of those expressions are not the same thing. The emotional interpretations of expressions are inferred by a person’s perception of the internal state of emotion.
The current research is still examining the complex dataset FER2013 [
18], which poses challenges for single machine learning models and deep learning approaches. Existing efforts encounter bottlenecks in terms of the accuracy, which are drawbacks when addressing FER2013 due to the following reasons:
- (1)
The existing approaches concentrate on feature processing within deep learning to effectively handle between five and seven multiple class labels.
- (2)
The existing approaches rely on a single deep learning method, which is not amenable to resolving imbalanced data issues, particularly evident in a class of seven classes like the disgust class label. Multi-class labels in the FER2013 dataset are numerous and are benchmark class labels for challenging FER datasets.
This proposed model uses ensemble methods that are advantageous for enhancing the performance of machine learning models. These algorithms aim to form better final classification results by combining the outputs of various machine learning methods. Reviewing methods for ensemble methods for FER datasets highlight methodological differences and reveal the reported performance of combinations of heterogeneous or homogenous methods. The results are prioritized to compare between methods. Thanks to deep learning algorithms with image processing, we propose an ensemble learning method called homogeneous deep learning. Homogenous deep learning entails the use of similar kinds of CNNs, for which we chose between six and seven ensembles from DCNN, EfficientNetB2, InceptionRestNetV2, ResNet50, Xception, DenseNet, and VGG16, respectively. The contributions of this paper are as follows:
- (1)
The homogeneous CNN ensemble combination determined to produce the best model is found to perform better than a single deep learning method, and the number of ensemble methods applied to the FER2013 dataset is determined for translating the model to uses in online learning education.
- (2)
The homogeneous CNN ensemble techniques handle the imbalanced data and yield better performances on minority classes, labelling ambiguous classes using seven multiclass labels: Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral.
Therefore, a novel approach to addressing these challenges in imbalanced datasets through applying homogeneous ensemble convolutional neural networks (HoE-CNNs), which are deep learning methods, is proposed in this paper. The deep learning algorithms are promising for use in researcher artificial neural networks because of their high performance in completing many tasks, especially image processing, video classification, speech recognition and so on. In this paper, a deep learning method is adopted in a single dataset and parameters are fine-tuned to fit the model to a domain dataset.
The structure of the paper is as follows:
Section 2 covers the literature review of facial expression recognition in education and deep learning and ensemble learning methods.
Section 3 describes and proposes a homogenous ensemble convolutional neural network.
Section 4 describes the dataset, experimental setup, experimental results and discussion. Finally,
Section 5 concludes the implementation and describes future work.
2. Literature Review
The education field has been dramatically transformed by the incorporation of online learning and machine learning. Research integrating online learning and machine learning techniques for monitoring the emotions of students has received significant attention in the educational technology field. The following is a review of notable research and studies in this domain. In [
19] the use of machine learning for detecting and monitoring the emotional states of online learners was explored. They discuss the potential of affective computing techniques in understanding student engagement and providing personalized feedback. The paper investigated the use of machine learning models to predict student emotions and engagement in massive open online courses (MOOCs) [
20]. They highlight the importance of real-time emotional feedback in enhancing the learning experience. In machine learning, students’ facial expressions, speech, and interaction patterns are discussed and analyses to infer students’ emotional states and adapt the learning content accordingly [
21]. These techniques can be used to design intelligent tutoring systems that provide tailored emotional support, which is the role of machine learning in recognizing students’ affective states during online learning. However, there are ethical considerations of using machine learning to monitor student emotions, including issues related to privacy, bias, and fairness in emotion recognition systems [
2]. These studies collectively underline the growing importance of integrating machine learning techniques with online learning platforms to monitor student emotions. They emphasize the potential benefits of adapting educational content and support to individual emotional states, ultimately enhancing the overall learning experience. However, they also highlight the need to consider ethical issues and privacy as this technology continues advancing.
Many types of machine learning techniques exist. As mentioned, machine learning is used to fit the individual algorithm to the dataset. The single techniques are effective for binary classes, for example, yes or no. However, in the case of FER, data are multiclass, with 7 main classes. Therefore, single algorithms are not feasible for applications in online learning environments for education. Ensemble methods have emerged as powerful techniques in the classification field, offering improved accuracy and robustness by combining multiple individual classifiers. Over the years, extensive research has been conducted to explore the effectiveness of ensemble methods and their applications in various domains. Researchers have aimed to discuss the advancements and challenges of ensemble methods in classification. Moreover, the ensemble methods use many datasets and aim to determine the best classifier performance. Preprocessing data before the data are input into the ensemble models is also very important. Face detection (FD) is the extraction of facial landmarks in face regions using color gradients, followed by the extraction of geometric features inside actual facial emotions [
22,
23,
24]. Face detection is an ultrafast face recognition solution that encompasses 6 landmarks and multifaced supports that are processed holistically, allowing us to determine the contribution of holistic processing to the face inversion effect [
13] used important key points by holistically locating faces, noting the coordinates of each located face and drawing a holistic key point around every face. The normalization value from the facial image is used to accelerate training; this step is performed differently by different algorithms. Afterwards, feature extraction is processed by feature selection, which ranks the importance of the existing features in the dataset and discards less important [
16]. In this paper, we preprocess the data and create labels and features. Various techniques have been proposed to improve performance before inputting data into the classification model. The details of the preprocessing operation and the algorithms using the applied FER dataset are summarized in
Table 1.
3. Proposed Frameworks
Our research focuses on a homogeneous ensemble of CNN models, which are trained and fine-tuned to specialize in recognizing specific facial expressions and 7 emotions. The homogeneous CNN ensemble approach uses a combination of deep learning models in the same category as CNN as DNN, EfficientNetB2, InceptionRestnetV2 enabling more robust and accurate emotion recognition. The architecture, as shown in
Figure 1, details preprocessing for collecting the data from face expression detection, image processing and the training process of the HoE-CNN, highlighting the techniques used to enhance performance.
The first necessary steps in facial detection methods are usually to preprocess the input original image, generate a face candidate box on the original image, and input the face candidate box into the network for feature extraction, for example, inputting an image into probability, Bx, By, Bh, Bw, Face and No Face classes. The following describes the two most widely used methods for generating face candidates. The first method is scanning the input image with different resolutions to obtain the candidate frame; this method can capture as large a position area of the face target as possible. The second method is to obtain the face candidate box by scaling the input image by a multiple of fixed scale.
Second, face candidate frame generation is based on selective search. This method primarily uses some inherent attributes of the input object to determine the possible position of the face candidate frame in the input image. This method greatly reduces the number of face candidate frames and reduces the computing time of the network. In addition, multiresolution methods can be used to obtain regions of different scales, and the CNN model can then be used for classifying each region to yield the category and confidence. Therefore, the image processing could include cropping the face emotion area, transforming the image to greyscale, normalizing the values into [0, 1], and augmenting the data. In most cases, data augmentation uses a larger dataset to yield a more accurate prediction and a more robust model, which involves increasing the quantity of training data using information exclusively as training data to avoid overfitting on models. Data augmentation is performed during training to increase the amount of data, including rescaling from the original scale horizontal flip, random zoom, feature wise, rotation, horizontal, and vertical shifts as
Figure 2.
Convolution neural networks (CNNs) have revolutionized the computer vision field. These networks are extremely effective for a variety of image and video recognition tasks due to their ability to automatically learn and extract key features from videos [
29] or image [
30,
31]. CNN has yielded other models in deep learning, indicating the potential of deep convolution neural networks. The CNN model contains a convolution layer, as shown in
Figure 3. There are layers for initial step in the process of extracting various traits from the input photographs. Convolution is a mathematical operation that occurs at the layer between the image being input and a filter of a particular size MxM. Sliding the filter over the input image yields the dot product between the filter’s elements and those of the input image with respect to the filter’s size (MxM). The output, known as the feature map, provides information about the image, including its corners and edges. The subsequent layers include additional features from the input image, and the feature map is later provided. The second layer is the pooling layer, in which the primary objective is to reduce the size of the convolved feature map to decrease computational costs. This is achieved by minimizing the connections between layers and working individually on each feature map. Depending on the technique used, many types of pooling procedures are available. Max pooling uses the largest component from the feature map. Average pooling determines the average of the components in an image segment with a predetermined size. Sum pooling determines the overall total of the components in the designated section. Typically, the pooling layer acts as a link between the convolutional layer and the FC layer. Finally, the weights and biases compose the fully connected (FC) layer, which connects the neurons of two layers. These levels are the final few in CNN designs and are frequently inserted before the output layer. Activation functions are one of the most essential components of CNN models for each model called homogenous ensemble convolution neural network. his paper comprises between 6 and 7-ensemble CNN models, which encompass the optimization parameters setting for each CNN model as
Table 2.
The ensemble model is activated to include decision multiple-class labels from the ensemble CNN model. The output p
i is the probability vector of CNN
i. Given a sample x, the highest argument of probability x and majority vote from the n-ensemble model are predicted following Equation (1).
where,
n is the number of n-ensemble model
x is class label x
pCNNi is the probability vector of CNN by i
For the created model and transfer model to the online learning environment setting, which focuses on the input process, a video is a sequence of images called frames. Experiments have been conducted on various video sequences with multiple faces occurring at different sizes, faces disappearing from the sequence or becoming occluded, and faces changing poses. The input video sequence is preprocessed by detecting and tracking the face. A probability value is computed for different scales and for two different views. In addition, the frame sequences were 30 frames per second. The images from the frame captures are transferred to the proposed mode from the save model and loaded to real-time video via camera, and FER is achieved using emotional classification from online learning.
To evaluate our model, we used the commonly used precision, recall, F1-score and accuracy metrics with four different combinations of predicted and actual values in multiclassification, including true positive (TF), false positive (FP), true negative (TN), and false negative (FN). True positive (TP) indicates the reliability of the predicted class n compared to the actual class n. False positive (FP) indicates the reliability of the other predicted class n compared to the actual other class n. False negative (FN) indicates the reliability of the predicted class n compared to the predicted other class n. Finally, true negative (TN) indicates the reliability of the predicted other class n that compared to the predicted class n. These metrics were evaluated using measure the following equations:
Accuracy should be as high as possible when all classes, both positive and negative, are divided by the sum of the possible actual and predicted values.
4. Experimental Results
All the experiments were tested on a Linux operating system using 11th Gen Intel Core i9-11900 @2.50GHz Ram CORSAIR VENGEANCE LPX 32GB (16 GBx2) DDR4 3200MHz NVIDIA GeForce 1060 6GB to process proposed model. We implemented all our experiment uses the FER2013 dataset, which is a benchmark FER dataset consisting of nearly 35,887 48 × 48 pixel 8-bit greyscale images of various people’s facial expressions, determined by their facial emotions as show in
Figure 4.
The data points were split into 3 datasets: 28,709 training data points, 3589 validation data points, and 3589 test data points, as shown in
Table 3. The facial images encompass both male and female subjects and 7 emotion classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral) as
Figure 5. Due to various levels of exposure, illumination, and occlusions, manual annotation is approximately 65±% accurate on this dataset, and the imbalanced dataset is challenging, as shown in
Figure 5 and
Figure 6.
As the number of datasets increases, the data are imbalanced between the 7 classes. The most common class of the FER dataset is happy, followed by neutral, sad, fear, angry, surprise and disgust. In particular, the disgust class is a more difficult class than the others. The hyperparameters of the CNN are optimized for FER2013 as
Table 4. Moreover, the models fine-tune the different hyperparameters of the CNN models. The proposed method was evaluated on FER2013.
The accuracies (%) of the models on the FER dataset are compared in
Table 4. The accuracy performance of the 7-ensemble CNN is compared with those of other models, referencing in [
26,
27,
28] given accuracy 71.20%, 72.40%, 73.39% and 73.40% in the similar environment with FER2013, respectively, who proposed a developed and fine-tuning CNN model as same as environment within 7 multi-class and imbalance datasets.
Our proposed model, namely, the HoE-7 ensemble, achieves an accuracy of 75.15% which used combined by Xception, DCNN, VGG16, EffcientNetB2, DenceNet, RestNet50 and InceptionRestnetv2, and the HoE-6 ensemble achieves a slightly lower accuracy than HoE-7 ensemble which used combination of Equation (1) by DCNN, VGG16, EffcientNetB2, DenceNet, RestNet50 and InceptionRestnetv2 to accuracy 73.73%. The original single CNN models as Xception, DCNN, EffcientNetB2, VGG16, DenceNet, RestNet50 and InceptionRestNetV2 achieve 65.19%, 66.75%, 68.09%, 68.20%, 69.10%, 69.70% and 71.60% accuracies, respectively. The experimental results demonstrate the robustness and strong performance of the 7-ensemble CNN Ensemble model.
Figure 7 depicts the confusion matrix for the final model on the FER2013 testing set, which classifies multiple classes into 7 emotions. The confusion matrix can be analyzed using true labels and predictions labels with values between 0 and 1, where 1 is the highest true positive value for the true class and prediction class. The analysis of model performance shows that the model most effectively categorizes the “happy” and “surprise” emotions because the number of training and validation points for these emotions were sufficient for our proposed model, yielding the best classification performance. The “happy” and “surprise” emotions are challenging to classify; however, the HoE-7 CNN yields robust and useful FER accuracies of 0.91 and 0.85, respectively. It achieves accuracies of 0.80 on the neutral class and 0.72 on the disgust class. On the angry, sad and fear classes, it achieves accuracies of 0.69, 0.61, and 0.57, respectively. The challenge of FER is the use of multiple classes to classify FER2023.
5. Discussion
The proposed methodology yields comprehensive experimental results that demonstrate the superiority of the HoE-CNN approach over traditional methods. True positive performance of each class labels and distribution of prediction are robustness from the ensemble methodology as shown
Figure 8. This evaluation uses benchmark datasets commonly used in FER research, showcasing the model’s ability to handle various expressions, poses, and lighting conditions. We also discuss the implications of our findings for educational applications. Ensemble methods often sacrifice interpretability for improved accuracy. Understanding the decision-making process of an ensemble method can be challenging due to the complexity introduced by multiple classifiers. Research efforts are needed to develop techniques for providing insights into ensemble decisions and making the models more interpretable. As the size of datasets continues increasing, scalability becomes a major concern for ensemble methods. Training and combining many classifiers can be computationally expensive and time-consuming. Researchers are exploring parallel and distributed computing techniques for addressing the scalability issues associated with ensemble methods. The selection and combination of diverse classifiers are crucial factors in the performances of ensemble methods. Ongoing research focuses on optimizing ensemble diversity through innovative ensemble generation algorithms, ensemble pruning techniques, or ensemble selection strategies. Developing efficient and effective algorithms to automatically determine the optimal diversity in ensembles remains an active research area. Ensemble methods are susceptible to noisy or corrupted data, which can adversely affect the ensemble performance. Research for developing robust ensemble techniques that are resilient to noise and outliers in data is ongoing. Approaches such as robust aggregation techniques and noise detection and removal methods are being explored to enhance the resilience of ensemble methods against data imperfections.
Aspect of online learning, this work integrated with Face Expression Recognition (FER) technology and developed the innovative model to offers a transformative approach to education by fostering a more interactive and personalized learning environment. By analyzing facial expressions in real-time, this innovative combination provides educators with invaluable insights into students’ engagement levels, emotional responses, and comprehension as shown in
Figure 9.
The applied models enable instructors to adapt their teaching methods dynamically, offering personalized assistance and tailored resources to students based on their emotional cues. Additionally, FER facilitates the creation of a more empathetic and supportive learning atmosphere, allowing educators to provide timely assistance and guidance to students in need. Ultimately, this integration not only enhances the educational experience by promoting active engagement but also enables educators to refine their teaching strategies for better learning outcomes. However, the student face has problems with non-face detection and limitation of their camera.
6. Conclusions
In modern e-learning environments, understanding students’ emotional states is crucial for personalized education. In this paper, we explore the application of a homogeneous ensemble CNN method to FER, combining multiple CNN models to improve emotion recognition accuracy. By enhancing our ability to detect students’ emotional responses, we aim to adapt educational content, provide real-time support, and create more engaging learning environments. The research findings have significant implications for the education field. The improved FER accuracy can be incorporated into educational technology to form more responsive and personalized learning environments. Teachers and educators can utilize this technology to gauge students’ emotional states and adapt their teaching methods, thus fostering more empathetic and effective educational experiences. In this paper, we design a model for face recognition and detail the proposed model using convolutional networks. Additionally, we overview the proposed face recognition method through three processes. The first process is facial detection. The next process is image processing. The last process is the FER process and transferring to online learning education applications. In this paper, we study FER on FER2013 datasets, which encompass 7 emotions for training validation and testing, including the “Sad”, “Fear”, “Disgust”, “Natural”, “Happy”, “Angry” and “Surprise” class. The HoE-CNN approach has revolutionized the classification field, offering improved accuracy and robustness by harnessing the power of multiple classifiers. Advancements in ensemble techniques, such as performance improvements, imbalanced data responses, diversity measures, and strategy combinations, contribute significantly to the effectiveness of the proposed models. However, challenges related to interpretability, scalability, diversity optimization, and noisy data persist. Continued research efforts are crucial to address these challenges and further enhance the capabilities of ensemble methods in classification. By overcoming these challenges, ensemble methods can continue to significantly contribute to various education domains.