Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments

Yao, Qiaohong; Wang, Mengmeng; Li, Yubin

doi:10.3390/sym17050711

Open AccessArticle

Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments

by

Qiaohong Yao

,

Mengmeng Wang

and

Yubin Li

^*

Faculty of Education, Liaoning Normal University, Dalian 116029, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(5), 711; https://doi.org/10.3390/sym17050711

Submission received: 17 February 2025 / Revised: 30 April 2025 / Accepted: 5 May 2025 / Published: 7 May 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

In the contemporary global educational environment, the automatic assessment of students’ online engagement has garnered widespread attention. A substantial number of studies have demonstrated that facial expressions are a crucial indicator for measuring engagement. However, due to the asymmetry inherent in facial expressions and the varying degrees of deviation of students’ faces from a camera, significant challenges have been posed to accurate emotion recognition in the online learning environment. To address these challenges, this work proposes a novel VGG-SwishNet model, which is based on the VGG-16 model and aims to enhance the recognition ability of asymmetric facial expressions, thereby improving the reliability of student engagement assessment in online education. The Swish activation function is introduced into the model due to its smoothness and self-gating mechanism. Its smoothness aids in stabilizing gradient updates during backpropagation and facilitates better handling of minor variations in input data. This enables the model to more effectively capture subtle differences and asymmetric variations in facial expressions. Additionally, the self-gating mechanism allows the function to automatically adjust its degree of nonlinearity. This helps the model to learn more effective asymmetric feature representations and mitigates the vanishing gradient problem to some extent. Subsequently, this model was applied to the assessment of engagement and provided a visualization of the results. In terms of performance, the proposed method achieved high recognition accuracy on the JAFFE, KDEF, and CK+ datasets. Specifically, under 80–20% and 10-fold cross-validation (CV) scenarios, the recognition accuracy exceeded 95%. According to the obtained results, the proposed approach demonstrates higher accuracy and robust stability.

Keywords:

facial emotion recognition; deep convolutional neural network; student engagement detection; online learning

1. Introduction

Student engagement within online learning is an issue that teachers and parents have been paying close attention to [1,2]. Students who actively engage in learning activities are more likely to gain profound learning experiences and better academic achievements [3,4,5,6]. However, due to the separation of time and space between teachers and students during the online learning process, it is rather difficult for teachers to promptly perceive changes in students’ engagement. Meanwhile, students are highly susceptible to the interference of the external learning environment during online learning, which poses a challenge to their ability to maintain attention for a long period. As an important external manifestation of engagement, facial expressions can intuitively reflect students’ learning status and interest [7]. Through the precise identification and analysis of these facial cues, it is possible to deduce the engagement levels of students.

Human facial emotional expressions are achieved through the synergistic actions of complex facial action units (AUs). For instance, AU1 represents the raising of eyebrows, AU6 represents the constriction of eyes, and AU12 represents the stretching of the corners of the mouth. Combinations of these AUs constitute a rich variety of facial expressions. However, when a person only raises one side of the mouth when expressing happiness or only lifts one eyebrow when expressing surprise, such asymmetry is a natural characteristic of emotional expression. Therefore, recognizing these asymmetric expressions is of great significance for understanding students’ genuine emotional states. Moreover, as students participate in online learning for relatively long periods, they will inevitably occasionally deviate from the field of view of the camera, revealing the side or partial contour of their faces. Such deviations can also lead to asymmetry in the facial presentation within the images, further intensifying the complexity of facial emotion recognition (FER) and posing a tremendous challenge to the precise identification of facial emotions.

In recent years, deep learning technologies have made significant progress in tasks such as feature extraction, classification, and recognition. In the field of computer vision, deep learning has been widely applied to tasks like object recognition, face detection, and emotion detection. With the rapid development of artificial intelligence (AI) and the widespread application of big data, the role of facial expression recognition in the teaching and learning process has become increasingly prominent, attracting the attention of more and more researchers [8,9]. Deep learning-based facial expression recognition technology enables “end-to-end” learning, directly extracting features from input images, which reduces human intervention and lowers the complexity of feature selection and extraction. Relying on this technology, teachers can more accurately understand the learning emotions of each student, adjust teaching strategies in a timely manner, and enhance student engagement and learning outcomes [10,11]. Therefore, developing an automatic facial expression recognition system for students is of great significance for improving the quality of education and the learning experience.

Meanwhile, the You Only Look Once (YOLO) series of models, an efficient real-time object detection framework, has been widely applied in fields such as autonomous driving and surveillance. Introduced by Redmon et al. [12] in 2016, YOLO treats object detection as a single regression problem, directly predicting bounding boxes and class probabilities from images. This design enables YOLO to perform exceptionally well in both training and inference, allowing for the rapid identification of multiple objects within an image. However, YOLO still faces challenges in detecting small objects, dealing with complex backgrounds, and handling occlusions [13]. For instance, poorly designed anchor boxes may lead to missed detections of small objects, noise in complex backgrounds can interfere with object detection, and partially occluded objects may be misdetected or missed due to reliance on anchor boxes and non-maximum suppression (NMS). Moreover, to maintain real-time performance, YOLO may sacrifice adaptability to new data distributions, thereby reducing detection accuracy and robustness [14].

Based on these considerations, a novel VGG-SwishNet deep learning model is proposed, which can automatically detect seven types of facial emotions using the facial images in the dataset, and it is particularly proficient in identifying asymmetric emotions caused by facial expressions themselves and postures. The proposed model adopts the pre-trained visual geometry group 16 (VGG-16) architecture, combines it with transfer learning (TL) techniques, and introduces the Swish activation function and batch normalization (BN) layers into the model. TL strategies not only leverage the robust feature extraction capabilities of the VGG-16 model but also empower it to learn meaningful feature representations from new datasets. BN layers reduce internal covariate shifts, ensuring the stability of data distribution, which facilitates effective gradient propagation in subsequent layers. Furthermore, it accelerates the model’s training process, enhancing the convergence speed and overall stability of the model [15].

The Swish activation function was proposed in 2017 and has drawn widespread attention in the field of deep learning [16]. Compared to traditional activation functions, it allows for smoother nonlinear mapping and can automatically adjust its degree of nonlinearity through a self-gating mechanism, thereby helping to mitigate the vanishing gradient problem. When dealing with the asymmetry caused by facial postures, its nonlinear characteristics can assist the model in more effectively extracting complex asymmetric features under different postures, such as the changes in facial contours and facial features when the face is in profile. Regarding the asymmetry of facial expressions themselves, the smoothness of the Swish activation function enables the model to more accurately capture subtle differences in expressions, such as a raised corner of the mouth or a drooping eyelid. Therefore, the Swish activation function has a significant advantage in recognizing asymmetric facial expressions. In addition, dropout layers with a rate of 0.25 are incorporated to prevent model overfitting. Finally, the performance of the proposed model was evaluated on benchmark datasets and compared with the engagement detection and real-time performance of five different versions of the YOLOv9 model (YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, and YOLOv9e) and the ViT-Base/16 model in online learning environments.

The general contributions of the paper can be outlined as follows:

(i): A novel and effective VGG-SwishNet model is proposed, which introduces the Swish activation function and BN layers based on the original VGG-16 model to improve the recognition capabilities for asymmetric facial emotions.
(ii): The proposed VGG-SwishNet model is applied to multi-face engagement detection in online learning environments, with visualization results provided. Additionally, the accuracy of the proposed model in engagement detection is validated through qualitative research methods involving semi-structured interviews.

The rest of this paper is organized as follows: Section 2 presents a brief review of existing FER and engagement detection methods. Section 3 provides a preliminary overview of the convolutional neural network (CNN), the VGG-16 model, TL, and Swish activation function; explains the architecture for engagement detection based on FER as well as the proposed VGG-SwishNet model; and describes the benchmark datasets and performance metrics used in the experiments. Section 4 offers a detailed explanation of image preprocessing and experimental setup and provides a thorough discussion and comparative analysis of performance evaluations on benchmark datasets, as well as the results and related issues concerning online engagement detection. Section 5 concludes the paper with a discussion of potential directions for future research.

2. Related Works

2.1. Facial Expression Recognition

Facial expression recognition has experienced remarkable advancements over the past few decades, with significant strides in incorporating machine learning and deep learning methodologies. Early facial expression recognition research mainly relies on traditional machine learning algorithms, such as support vector machines (SVMs), k-nearest neighbors (KNNs), random forests (RFs), and neural networks (NNs), often in conjunction with a suite of feature extraction techniques to bolster recognition accuracy. Common feature extraction methods include the Gabor filter, Haar wavelet, local binary pattern (LBP), local ternary pattern (LTP), histogram of oriented gradient (HOG), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), principal component analysis (PCA), etc. For instance, Lee et al. [17] introduced an algorithm predicated on regularized discriminant analysis (RDA) for FER, harnessing the combined strengths of LDA and QDA. Holder and Tapamo [18] enhanced the traditional gradient local ternary pattern (GLTP) with the incorporation of a more precise Scharr gradient operator and PCA for dimensionality reduction, subsequently validating their approach using an SVM on the extended Cohn-Kanade (CK+) and JAFFE datasets. Bellamkonda and Gopalan [19] proposed a facial expression recognition model based on SVM, which uses Gabor wavelet and LBP for facial feature extraction and PCA for dimensionality reduction. Eng et al. [20] demonstrated the efficacy of HOG for feature extraction from facial images, employing an SVM classifier for expression categorization, and substantiated their method’s validity on the JAFFE and KDEF datasets.

With the breakthrough of deep learning in image processing, methods based on deep learning have begun to dominate facial expression recognition. Mollahosseini et al. [21] discussed the application of deep neural networks in facial expression recognition and proposed a deep learning framework based on GoogLeNet and AlexNet to recognize facial expressions. The model was verified on seven different datasets. Jain et al. [22] adopted a deep neural network model that integrates CNNs for feature extraction with recurrent neural networks (RNNs) for emotion classification. Sari et al. [23] devised a lightweight CNN architecture for facial expression recognition and tested it on the CK+, JAFFE, and KDEF datasets. The rapid development of deep belief networks (DBNs) has also attracted considerable research interest. Liu et al. [24] constructed an innovative boosted deep belief network (BDBN) framework for recognizing facial expressions, which amalgamates the capabilities of strong and weak classifiers to enhance emotion recognition. Zhao et al. [25] capitalized on the unsupervised feature learning attributes of DBNs and the strengths of multi-layer perceptrons (MLPs), achieving a deep learning-based approach to facial expression recognition that exhibited commendable performance on the JAFFE and CK datasets. Roy et al. [26] introduced ResEmoteNet, a novel deep learning architecture for FER that integrates CNNs, Squeeze-and-Excitation (SE) blocks, and residual networks. The inclusion of an SE block focuses on the significant features of the human face, enhancing feature representation and suppressing less relevant ones, thereby reducing loss and improving overall model performance.

In recent years, the introduction of the YOLO model has brought new opportunities for the development of facial expression recognition technology. Bharathi et al. [27] proposed a novel approach utilizing the YOLO algorithm for the real-time detection of individuals in video frames, followed by a shallow CNN model for facial expression classification. The model achieved a notable accuracy of 95.57% on the FER-2013 dataset, demonstrating its effectiveness in recognizing seven distinct facial expressions. Vanamoju et al. [28] proposed a deep learning classifier based on YOLO for FER. This study leveraged the single-shot detection capability of the YOLOv8 algorithm, using a single neural network to predict bounding boxes and emotion class probabilities, achieving a test accuracy of 72.56% on the FER-2013 dataset. Parambil et al. [29] conducted a comprehensive comparative study of the YOLO model series, evaluating the performance of YOLOv5 to YOLOv9 in emotion recognition tasks. Based on the AffectNet dataset, the study compared the strengths and weaknesses of different YOLO models in terms of accuracy, inference time, model size, and generalization ability. It was found that YOLOv9e achieved the best accuracy, while YOLOv8n struck a good balance between speed and accuracy. Additionally, Hasan et al. [30] proposed a structural model combining YOLO face detection and CNN to predict human facial emotions. This model detected faces using YOLO and extracted features with CNN to classify seven basic emotions. The experimental accuracy reached 94% on the FER-2013 dataset.

2.2. Engagement Detection

In engagement detection, Hasnine et al. [31] crafted an educational application that computes engagement levels and categorizes student engagement as “highly engaged”, “engaged”, and “disengaged”. Ayouni et al. [32] predicted student engagement levels based on students’ activities recorded in Blackboard reports and compared three different machine learning predictive models. Sharma et al. [33] used the Haar cascade algorithm for face detection and eye region extraction. A CNN is used to identify whether students face the camera and classify their attention status as “distracted” or “focused”. If a student’s attention is defined as “focused”, the facial emotion of the student is subsequently recognized, and the student’s engagement is classified as “very involved”, “nominally involved”, or “not involved”. Shen et al. [7] designed a CNN model based on domain adaptation for facial expression recognition to assess the learning engagement of online learners in the MOOC scenario, and the engagement of learners was classified as “great”, “not bad”, and “not so well”.

3. Methodology

3.1. Preliminary

CNNs represent a class of deep learning architectures that have been widely applied in areas such as image recognition [34,35]. They automatically extract features, reducing the need for manual feature engineering, and enhance model generalization through structures such as convolutional, pooling, and fully connected layers. Owing to their hierarchical architecture, CNNs can efficiently capture a rich representation of features ranging from low-level edges and textures to high-level semantic information such as object parts and shapes.

VGG-16 [36] is a paradigmatic deep convolutional neural network (DCNN) architecture developed by the Visual Geometry Group of Oxford University. It extracts rich features by stacking multiple convolutional and pooling layers. Additionally, the model demonstrates excellent generalization ability and performs admirably on the ImageNet dataset [37], making its pre-trained model a powerful feature extractor applicable to various image recognition tasks. Moreover, the VGG-16 model is straightforward to implement and expand, with its relatively simple structure facilitating easy implementation and modification, allowing for convenient fine-tuning to better adapt to facial expression recognition tasks.

TL is a commonly used machine learning technique that allows models trained on other tasks to be used as a starting point [38]. By fine-tuning the existing model, the knowledge acquired from one task can be transferred to another distinct yet related task, thereby adapting to the new learning task. It is particularly effective when data resources are limited or training costs are high [39].

The Swish activation function is a self-gated activation function that can automatically adjust its degree of nonlinearity. This function approaches linearity when the input value is large, allowing for gradient flow, and smoothly transitions near zero to facilitate stable gradient propagation. It also gradually saturates for small input values, thereby alleviating the vanishing gradient problem [40]. Additionally, the non-monotonicity and smoothness of the Swish activation function help stabilize gradient updates during backpropagation, promote faster convergence to optimal solutions, and enhance the model’s generalization ability to new data.

3.2. Proposed Architecture for Facial Emotion Recognition and Engagement Detection

An architecture for assessing the level of student engagement through recognizing their facial emotions is proposed, as depicted in Figure 1. The arrows indicate the direction of the workflow, and the red rectangles are used to highlight the facial regions. The FER model is based on a pre-trained VGG-16 network, which is further fine-tuned for our purposes. Following the accurate identification of students’ facial emotions using the proposed model, the level of student engagement can be automatically assessed by integrating the emotion weights (EW) associated with each emotion.

3.2.1. Facial Emotion Recognition

In this stage, the VGG-16 model initially introduced in Section 3.1 was deployed and fine-tuned. Fine-tuning is a commonly used TL technique. The principal advantage of this method is that it cannot only avoid greatly changing the beneficial feature representation within the pre-trained model but also make the model adjust moderately on the new dataset to meet the needs of new tasks. The VGG-16 model after fine-tuning is shown in Figure 2.

As shown in Figure 2, the proposed model consists of a total of 46 layers. In the original VGG-16 model, the rectified linear unit (ReLU) activation function is used uniformly as the activation function. However, numerous studies have indicated that the ReLU function can lead to the problem of “dying neurons”, which in turn affects the model’s expressiveness [41,42]. This is because the derivative of ReLU is zero when the input is less than zero. Once a neuron’s output becomes zero, its weights cannot be updated in subsequent training, causing the neuron to permanently output zero. In contrast, the Swish activation function possesses smooth properties and excellent gradient propagation characteristics, enabling more stable weight updates during training. Its non-monotonic nature allows it to better capture complex relationships among input features, thereby enabling the model to learn richer feature representations and enhance its ability to fit complex data patterns [43]. Additionally, the self-gating mechanism of the Swish activation function enables the model to adaptively adjust the activation level based on the magnitude and importance of the input, thus better handling different types of features [40]. Studies have also demonstrated that, in many challenging datasets, especially in deeper models, the Swish activation function outperforms the ReLU activation function [16]. Therefore, in the proposed model, all ReLU activation functions are replaced with Swish activation functions. The calculation formula of the Swish activation function is shown in Formulas (1) and (2).

f (x) = x \cdot σ (β x)

(1)

σ (x) = \frac{1}{1 + e^{- x}}

(2)

where

σ (x)

denotes the Sigmoid function, and

β

is a hyperparameter used to control the nonlinearity of the Swish activation function.

Subsequently, a BN layer was added after each max-pooling layer in the improved VGG-16 architecture. The max-pooling layer is primarily used to reduce the spatial dimensions of the feature maps while retaining important and salient information. Introducing a BN layer after the pooling layer allows for the normalization of the features after pooling, thereby stabilizing the data distribution. This is because the pooling layer may alter the numerical distribution of the feature maps, while the BN layer, by normalizing the mean and variance, effectively reduces the problem of internal covariate shift, making the learning process of subsequent layers more stable and efficient [15,44]. The calculation formulas for the BN layer are shown in Equations (3)–(6).

μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}

(3)

σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} {(x_{i} - μ_{B})}^{2}

(4)

\hat{x_{i}} = \frac{x_{i} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}}

(5)

y_{i} = γ {\hat{x}}_{i} + β

(6)

where

m

denotes the batch size, and

x_{i}

represents an individual sample within the batch.

μ_{B}

and

σ_{B}^{2}

are the mean and variance of the batch, respectively. The parameters

γ

and

β

are trainable, serving as the scaling and translation factors, respectively.

Finally, the last dense layer was replaced with a novel one for classifying facial images into one of seven emotion categories: angry, disgusted, fearful, happy, neutral, sad, and surprised. Additionally, two dropout layers were set with a dropout rate of 0.25 to prevent overfitting. The output layer utilizes a softmax function and cross-entropy loss function for multi-class classification, as shown in Formulas (7) and (8).

σ (z)_{j} = \frac{e^{z_{j}}}{\sum_{k = 1}^{K} e^{z_{k}}}

(7)

L = - \sum_{j = 1}^{K} y_{j} l o g ({σ (z)}_{j})

(8)

where

K

represents the total number of different categories, and

z_{j}

is the

j^{t h}

element in the original output vector.

σ (z)_{j}

is the prediction probability of the

j^{t h}

category after Softmax function transformation.

y_{j}

is the

j^{t h}

element in the real label vector.

L

is the loss value.

3.2.2. Engagement Detection

In the engagement detection stage, the Haar feature-based cascade classifier within OpenCV [45] was utilized to identify and extract facial regions. Then, the method proposed in Section 3.2.1 was used to recognize the emotion of the extracted face and output the dominant emotion and its emotional probability (DEP) corresponding to each face. Finally, the concentration index (CI) was obtained by multiplying the dominant emotion probability by the corresponding EW, and the formula is shown in Equation (9).

CI = DEP × EW

(9)

The EW is a scalar ranging from 0 to 1, mainly used to quantify the intensity of facial emotional expressions at a particular instant. In this study, the EW, defined by Sharma et al. [33], is shown in Table 1. According to the obtained CI, a student’s level of engagement (highly engaged, nominally engaged, and not engaged) can be evaluated according to Table 2.

3.3. Facial Expression Datasets

Three benchmark facial expression datasets were used to evaluate the proposed approach: the Japanese female facial expression dataset, the Karolinska directed emotional faces dataset, and the Extended Cohn-Kanade dataset. A brief description of the three datasets used is given below.

3.3.1. Japanese Female Facial Expression Dataset (JAFFE)

The JAFFE [46] dataset comprises a total of 213 grayscale images of 10 Japanese females, with each image measuring 256 × 256 pixels. The dataset includes six basic facial expressions: angry, disgusted, fearful, happy, sad, surprised, and neutral. Each expression exhibits subtle asymmetry when presented on the face, such as differences in the upward movement range of the corners of the mouth and the movements of the eye muscles. These slight variations provide abundant samples for the study of asymmetric facial expressions. Moreover, this dataset can also be used to evaluate the performance and generalization capabilities of models when dealing with a limited amount of data. Some of the images are shown in Figure 3.

3.3.2. Karolinska Directed Emotional Faces Dataset (KDEF)

The KDEF [47] is a dataset that consists of 4900 human facial expression images, with each image having a resolution of 562 × 762 pixels. This group of pictures contains facial expression images of 70 individuals (35 males and 35 females). Each individual shows six basic expressions (angry, disgusted, afraid, happy, sad, surprised) and neutral expressions. These expressions are captured from five distinct perspectives: full left profile, left half-profile, front view, right half-profile, and full right profile. The multi-angle images of the KDEF dataset facilitate the model’s ability to understand and recognize facial expressions from various perspectives, which is particularly beneficial for identifying asymmetric expressions that may occur naturally. Part of the image is shown in Figure 4.

3.3.3. Extended Cohn-Kanade Dataset (CK+)

The CK+ [48] dataset is an extension of the original CK dataset, comprising 593 video sequences from 123 subjects. The participants’ ages range from 18 to 50 years, and 69% of the participants are female. In terms of ethnicity, 81% are Euro-American, 13% are Afro-American, and 6% belong to other groups. This diverse demographic composition enables the models trained on this dataset to be more effectively generalized across different populations, thereby enhancing their universality and robustness. Among the 593 video sequences, 309 sequences are annotated with one of the six basic emotions: angry, disgusted, fearful, happy, sad, and surprised. Each video sequence captures the dynamic transition from a neutral facial expression to a peak emotional expression. For the purposes of this study, the first frame of each of the 309 sequences was selected to represent the neutral expression, while the final three frames were chosen to represent the target emotion. This approach resulted in a dataset comprising 1236 images, each with a resolution of 640 × 490 pixels. Some of the images are shown in Figure 5.

3.4. Performance Metrics

Four widely utilized performance metrics, namely accuracy, precision, recall, and the F1-score, were adopted to thoroughly assess the proposed approach. These well-established metrics offer insightful evaluations of classification model performance. Accuracy indicates the proportion of correctly predicted instances out of the total. Precision measures the accuracy of positive predictions among all predicted positives. Recall, or sensitivity, reflects the model’s ability to identify all actual positives. The F1-score provides a balanced view by combining precision and recall into a single metric. The formulas used for calculating these metrics are provided as Equations (10), (11), (12), and (13), respectively.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(10)

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(13)

4. Experimental Results and Discussion

4.1. Image Preprocessing

In the preprocessing steps, the image size of each dataset was resized to 224 × 224 pixels. To ensure the integrity of information during resizing and to address variations in illumination and head pose, OpenCV’s Haar feature-based cascade classifier was employed for face detection in the original images, and facial region was cropped to reduce the interference of background information. To handle head pose variations, the 68-point facial landmark detector from the dlib library was used to detect key facial features and calculate the angles of the eyes, nose, and mouth. The face was then aligned to a standard position through affine transformation, thereby enhancing the model’s robustness to pose variations. To mitigate the impact of different illumination conditions, histogram equalization was applied to adjust the brightness and contrast of the images, which helps to reduce the influence of lighting variations on model performance. After alignment and normalization, the images were resized to 224 × 224 pixels using bicubic interpolation. This method ensures image quality and preserves more facial detail information while resizing. Following these preprocessing steps, the images were fed into the proposed VGG-SwishNet model to extract reliable features for classifying the images into seven different emotion categories.

4.2. Experimental Setup

The experiment was conducted on a 64-bit Microsoft Windows 10 PC with MATLAB R2024b software. The CPU model is 13th Gen Intel (R) Core (TM) i9-13900k (Intel Corporation, Santa Clara, CA, USA), the clock frequency is 3.00 GHz, the graphics card is NVIDIA GeForce RTX 3090 (NVIDIA Corporation, Santa Clara, CA, USA), and the RAM is 128 GB. In this study, experiments were conducted on the JAFFE, KDEF, and CK+ datasets utilizing two distinct splitting modes: (i) in one mode, 80% of the images served as the training set, and 20% were designated as the test set; (ii) in the other mode, the 10-fold CV case, the entire dataset was easily divided into ten parts. During each CV process, one part was used for testing, and the remaining nine parts were combined for training models. Each time a different part was selected as the test set, this process was repeated ten times to ensure that each sample can be used as a test. The training hyperparameters for the network model are detailed in Table 3.

4.3. Experimental Analysis and Comparison

4.3.1. Performance Evaluation on Datasets

The performance of the proposed VGG-SwishNet model was assessed through evaluation on the JAFFE, KDEF, and CK+ datasets. The results show that under the scenarios of an 80–20% split and 10-fold CV, the test accuracy of our approach exceeds 95% on all three datasets, demonstrating excellent performance. An experiment was also designed to generate confusion matrices for the method across the three datasets. The confusion matrix was utilized to present the classification outcomes of a model across various datasets. Illustrating the correspondence between the actual and predicted classes enabled us to intuitively grasp the model’s classification accuracy, types of errors, and their distribution. Figure 6 illustrates the confusion matrix of the proposed model on the JAFFE dataset. From the results, it is evident that the model achieved a perfect classification accuracy of 100% for “Angry”, “Disgusted”, “Fearful”, “Happy”, “Neutral”, and “Surprised” expressions, indicating that all samples of these emotions were correctly classified. The “Sad” expression has a 17% chance of being misclassified as “Disgusted”. This is due to the subtle similarities in facial features or expression intensity between “Disgusted” and “Sad” expressions, leading to a small number of ambiguous facial expression images in the JAFFE dataset. The confusion matrix of the proposed method on the KDEF dataset is shown in Figure 7. Overall, the classification accuracies of the seven emotions are relatively high. The model performs exceptionally well for emotions such as “Happy”, “Neutral”, and “Surprised”. Figure 8 displays the confusion matrix of the proposed approach on the CK+ dataset. It can be clearly seen from the figure that the model achieved the best results on this dataset, with no images being misclassified.

To thoroughly assess the overall performance of our approach, the precision, recall, and F1-score for each emotion type were calculated on the JAFFE, KDEF, and CK+ datasets. Subsequently, the macro-averaging method was employed to mitigate the influence of class imbalance within the datasets on the model’s performance. The macro-averaging method is a common performance evaluation approach in multi-class classification problems, which measures the overall performance of the model by averaging the precision, recall, and F1-score of each type. Table 4 presents the macro-averaged results of the proposed model on the JAFFE, KDEF, and CK+ datasets. On the JAFFE dataset, the model achieved a macro-averaged precision of 97%, a macro-averaged recall of 98%, and a macro-averaged F1-score of 97%, demonstrating the effectiveness of the proposed method for facial expression recognition on small-scale pose datasets. The KDEF dataset includes images captured from five different angles: full left profile, half left profile, front view, half right profile, and full right profile, which are used to illustrate the model’s ability to classify emotions from images captured at various angles. The experimental results obtained using the KDEF dataset indicate that a macro-averaged precision of 95%, a macro-averaged recall of 95%, and a macro-averaged F1-score of 95% were achieved. Therefore, our proposed method can effectively recognize emotions from facial images taken from different angles. Additionally, the proposed method achieved a macro-averaged precision of 100%, a macro-averaged recall of 100%, and a macro-averaged F1-score of 100% on the CK+ dataset, further demonstrating the superior generalization capability and strong classification performance of our approach.

4.3.2. The Impact of the Activation Function on the Model’s Accuracy

This study evaluated the impact of various activation functions by replacing ReLU in the VGG-16 model with six alternatives (Swish, Softplus, ELU, LeakyReLU, ClippedReLU, and GeLU) and testing on the JAFFE, KDEF, and CK+ datasets. The validation losses are shown in Figure 9, Figure 10 and Figure 11, respectively. From these three figures, it is evident that the ClippedReLU function has the highest validation loss on the JAFFE and KDEF datasets, while the Softplus function has the highest validation loss on the CK+ dataset. Additionally, the test accuracy of the seven activation functions on the JAFFE, KDEF, and CK+ datasets was recorded, as depicted in Table 5. From Table 5, it can be observed that the Swish activation function achieved higher recognition accuracy rates on the JAFFE, KDEF, and CK+ datasets compared to ReLU and its variants LeakyReLU and ClippedReLU. This was primarily due to the smoothness of the Swish activation function, which, unlike ReLU, LeakyReLU, and ClippedReLU, has a continuous derivative across all intervals. Furthermore, the Swish activation function achieved a recognition rate of 97.67% on the JAFFE dataset and 100% on the CK+ dataset, surpassing other activation functions and the original ReLU activation function. Although on the KDEF dataset, the best recognition rate of the Swish function was only 95.20%, ranking second among all activation functions with the ELU activation function in first place, and the recognition accuracy of the ELU function on the JAFFE and CK+ datasets was significantly lower than that of Swish. Therefore, the proposed model with the Swish activation function can reliably recognize various facial emotions in most cases.

4.3.3. Comparison of Performance with Advanced Models

As a cutting-edge representative in the field of object detection, the YOLOv9 model possesses significant value for research and application purposes [49]. In recent years, Transformer-based models have also garnered widespread attention. Among them, the Vision Transformer Base with 16 × 16 Patches (ViT-Base/16) model, which uses patches with a size of 16 × 16, is capable of capturing more detailed facial image information and has demonstrated notable advantages in the FER task. Furthermore, it is a widely recognized and frequently used benchmark model. By comparing our method with different variants of the YOLOv9 model as well as the ViT-Base/16 model, the performance of our approach relative to advanced methods can be clearly and objectively assessed. In the study, all models underwent rigorous testing on the JAFFE, KDEF, and CK+ datasets, with identical hyperparameter configurations, as depicted in Table 3. Two distinct data partitioning strategies were employed: one involved allocating 80% of the dataset for training and 20% for testing, while the other utilized 10-fold cross-validation (CV) to comprehensively evaluate model performance. The experimental outcomes are detailed in Table 6 and Table 7. The observation results indicate that, under both data partitioning strategies, the proposed model achieved the highest accuracy on all three datasets, followed by the ViT-Base/16 model. The YOLOv9t model had the lowest recognition rates across the datasets, followed by YOLOv9s. This is largely because their lightweight structures may compromise certain feature expression capabilities. The YOLOv9m, YOLOv9c, and YOLOv9e models, which have relatively complex architectures, demonstrate certain levels of accuracy across the three datasets but still show a gap compared to our approach. Overall, the proposed method shows good generalization ability and stability in the FER task.

4.3.4. Comparison of Performance with Prior Studies

In this study, the performance of the proposed method was compared with other state-of-the-art methods on the JAFFE, KDEF, and CK+ datasets. Table 8 lists the methods, the total number of samples in the three datasets, the data splitting methods, and the test accuracies. As shown in the tables, when the JAFFE dataset consisted of 213 images and the KDEF dataset contained 4900 images, under the 80–20% data split scenario, the Fusion-CNN method proposed by Jabbooree et al. [50], which integrates geometric features extracted using a β-skeleton undirected graph and ellipse parameters with appearance features from a 2D-CNN, achieved accuracies of 93.07% and 90.30% on these two datasets, respectively. In contrast, the CCNN-SVM model proposed by Rashad et al. attained accuracies of 98.4% and 90.43% on the same datasets [51]. Although the proposed method achieved an accuracy of 97.67% on the JAFFE dataset, it demonstrated superior performance on the KDEF dataset, outperforming the above two methods with a higher accuracy rate. On the CK+ dataset, the proposed model achieved a test accuracy of 100% using the 80–20% data split and 99.60% using 10-fold cross-validation (CV), showing strong competitiveness compared to other methods. It is evident that the proposed approach has significant advantages in the FER task compared to prior studies.

4.4. Result Visualization of Student Engagement Detection and Comparative Analysis

4.4.1. Experimental Setup

An approximately 4 min long teaching video was prepared for students to watch in advance. The students’ facial expressions were subsequently recorded during online learning. After the video recording was completed, the proposed model was employed to recognize and classify facial expressions from each frame of the video, with a frame rate of 10 frames per second. The operating environment was identical to that described in Section 4.2. A total of nine second-year graduate students majoring in educational technology participated in this assessment, including three males and six females.

4.4.2. Experimental Results

The proposed approach was tested on the recorded video, and the detection results are depicted in Figure 12. It can be clearly observed from the figure that all faces were recognized and marked with red rectangular boxes. Above the red rectangular box, the dominant emotion of the current face and its corresponding probability are indicated by white text on a red background. Below the box, the current engagement level is also marked with white text on a red background. The red rectangular boxes were added using a Haar feature-based cascade classifier for face detection, which visually highlights the regions of interest in the current facial expression recognition task. The detection results show that all students were “Highly engaged” during the process of watching the video in this experiment.

4.4.3. Semi-Structured Interviews

To further validate the effectiveness of the detection results, in-depth semi-structured interviews were conducted with the nine students involved in the experiment. This method is typically qualitative and is often used as an exploratory tool across various research fields. Based on the students’ responses, their levels of engagement in the online learning session were analyzed. The interview results indicated that eight students reported being “Highly engaged” while watching the video, and one student reported being “Nominally engaged”. In contrast, our proposed method detected that all nine students were “Highly engaged”. This discrepancy may be due to the unnatural behavior exhibited by some students during video viewing, which could have interfered with detection. Additionally, variations in camera resolution and lens quality can also affect the model’s detection. For example, low-quality cameras may produce images with insufficient detail, making it difficult for the model to accurately detect facial expressions. Moreover, different background noise levels, lighting conditions, and changes in network conditions can also impact the robustness and reliability of the model when applied in online learning scenarios. Therefore, despite minor differences between the interview and detection results, the proposed method still effectively detected students’ engagement levels. Overall, the proposed method performs well in assessing students’ engagement in online learning environments.

4.4.4. Comparative Analysis of Advanced Models

Figure 13 presents the engagement distribution of the proposed model, different versions of the YOLOv9 model, and the ViT-Base/16 model among the nine participants in the form of a two-dimensional bar chart. Each model’s performance is represented by bars of three different colors corresponding to the three levels of engagement: “Highly engaged”, “Nominally engaged”, and “Not engaged”. It is evident from the figure that the proposed method detected high engagement in all nine participants, which is largely consistent with the interview results. This indicates that the model can effectively capture facial expressions in online learning environments and accurately calculate the level of engagement. Moreover, our method exhibits high consistency in performance across different participants, demonstrating good generalization capabilities. In contrast, other models such as YOLOv9t and YOLOv9s performed relatively poorly on some participants, primarily due to their limited ability to capture subtle features when handling facial expression recognition tasks. Additionally, the detection results of the YOLOv9e model show that seven participants were classified as “Not Engaged”, which significantly deviates from the actual interview outcomes. This may be due to the model’s high complexity, which led it to learn specific noise and details. As a result, the model’s accuracy is relatively low in practical applications, especially in online learning environments. Therefore, the analysis results of the two-dimensional bar chart support the effectiveness and reliability of the proposed approach in detecting student engagement in online learning, providing a more effective tool for assessing student learning states.

4.4.5. Real-Time Testing and Comparison of Advanced Models

In this study, the real-time performance of different versions of the YOLOv9 model (YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, and YOLOv9e), ViT-Base/16, and our method was evaluated in the online engagement detection task. Each version of the YOLOv9 model was optimized for different application scenarios and requirements in its design. The ViT-Base/16 model offers a good trade-off between accuracy and computational efficiency. Table 9 presents a comparison of the seven models in terms of the total processing time, floating-point operations per second (FLOPs), and the number of parameters. It can be seen that the YOLOv9t model has the shortest total processing time (175.20 s), followed by the proposed model, which has a total processing time of 178.59 s. In contrast, YOLOv9e has a significantly longer total processing time of 665.64 s. Additionally, the proposed method has FLOPs of 15.48G, second only to YOLOv9t, which enables faster inference and makes it highly advantageous for tasks requiring high real-time performance. In terms of the number of parameters, YOLOv9t and YOLOv9s, as lightweight models, both have parameters under 10M, allowing for rapid deployment in resource-constrained scenarios. Although our approach has a significantly larger number of parameters compared to the other models, it demonstrates a clear advantage in processing speed. This is attributed to the introduction of the BN layers, which further enhances computational efficiency. Therefore, in scenarios with high real-time requirements, such as online learning monitoring and real-time video analysis, the proposed approach, with its excellent processing speed and computational efficiency, can quickly capture students’ facial expressions and behavioral features, showing significant potential for practical applications.

5. Conclusions

Comprehending the extent of student engagement in online learning environments is crucial for improving teachers’ online teaching strategies and enhancing students’ online learning experiences. This paper proposed a novel VGG-SwishNet model, which combines the VGG-16 architecture with the Swish activation function for the task of facial expression recognition. This integration fully leverages the powerful feature extraction capabilities of VGG-16 and the smoothness and self-gating mechanism of the Swish activation function, significantly enhancing the model’s performance. The incorporation of the Swish activation function aids in alleviating the vanishing gradient problem, thereby enabling the model to more effectively capture subtle and asymmetric features within facial expressions. This aspect has received less attention in previous studies but is a crucial issue in facial expression recognition. The proposed model was evaluated on the JAFFE, KDEF, and CK+ datasets, with both 80–20% splitting and 10-fold CV being employed. The results indicate that under both validation methods, the model achieved the best test accuracy of over 95% on all three datasets, demonstrating excellent robustness and competitiveness. Building on the accurate identification of each emotion using the proposed method, student engagement levels were assessed by integrating the corresponding weights of each emotion and categorizing them into three tiers: “Highly engaged”, “Nominally engaged”, and “Not engaged”.

The proposed approach was tested on nine students in an online learning environment, combining the results with semi-structured interviews, and the effectiveness of the VGG-SwishNet model in recognizing emotions and detecting engagement levels online was confirmed. Moreover, the VGG-SwishNet model is not only applicable to facial expression recognition tasks but can also be extended to other multi-class problems that require processing asymmetric features, such as identifying asymmetric tumor characteristics in the field of medical image recognition. In future research, we will also continue to explore more lightweight versions of the model. Specifically, we plan to utilize well-known lightweight network architectures, such as MobileNet and EfficientNet. By integrating these lightweight architectures into the proposed model, we aim to reduce the number of model parameters and computational complexity while further improving its scalability in online learning environments. At the same time, we are also aware of some limitations in this study. For example, the datasets used in the experiment (JAFFE, KDEF, and CK+) have a relatively small number of samples and are strictly controlled, which may not fully reflect the complexity of real-world online learning environments. Therefore, based on the model proposed in this study, we plan to establish a facial expression recognition system in a real online learning environment. During this process, we will collect real-world data to establish a facial expression dataset for online learning, thereby further enhancing the accuracy of student facial expression recognition in online learning environments.

In this study, the primary focus has been on developing a novel VGG-SwishNet model that can automatically and accurately detect students’ facial expressions during online learning. To validate the effectiveness of our proposed method, its performance was first comprehensively evaluated. Subsequently, the model was applied to engagement detection, and its practical effectiveness was further validated through semi-structured interviews. Moreover, comparisons with different variants of the YOLOv9 model and the ViT-Base/16 model have also confirmed that the proposed method has significant advantages in terms of engagement detection accuracy and computational efficiency in the context of online learning. However, it is also deeply recognized that this is merely a preliminary exploration. In future research, we plan to develop a more comprehensive engagement assessment system. Building on the accurate recognition of facial expressions, we will incorporate additional indicators of engagement, such as head movements and eye contact, to provide a more holistic and accurate evaluation of students’ engagement in online learning. This will be an important direction for our future work. Moreover, gaining a deep understanding of how different students’ engagement varies with different learning content is also crucial as it holds significant importance for promoting personalized learning. Therefore, we will also explore the complex relationship between learning content and changes in student engagement and assess the impact of different learning conditions (such as different subjects, teaching methods, or levels of interaction) on student engagement. This will help teachers more effectively improve their teaching strategies and enhance students’ engagement in online learning.

Author Contributions

Conceptualization, Q.Y. and M.W.; methodology, M.W.; software, M.W.; validation, Q.Y., M.W. and Y.L.; formal analysis, M.W.; investigation, M.W.; resources, Q.Y.; data curation, M.W.; writing—original draft preparation, Q.Y.; writing—review and editing, Y.L.; visualization, Q.Y.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Humanities and Social Science Research Project of the Ministry of Education of China (No. 22YJA880076).

Institutional Review Board Statement

This study was conducted in accordance with the rules of the 1975 Declaration of Helsinki (revised 2013), and the protocol was approved by the Ethics Committee of Liaoning Normal University (No. LL2024139) on 9 September 2024.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study. Informed consent was obtained from all participants to publish the identifiable images in the online publication.

Data Availability Statement

JAFFE is available in Papers with Code at https://paperswithcode.com/dataset/jaffe (accessed on 10 February 2025). KDEF is available in Kaggle at https://www.kaggle.com/datasets/muhammadnafian/kdef-dataset (accessed on 10 February 2025). CK+ is available in Kaggle at https://www.kaggle.com/datasets/zhiguocui/ck-dataset (accessed on 10 February 2025). The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bergdahl, N. Engagement and disengagement in online learning. Comput. Educ. 2022, 188, 19. [Google Scholar] [CrossRef]
Guo, Q.; Graham, C.R.; Borup, J.; Sandberg, B.; West, R.E. Parental support challenges for K-12 student online engagement. Distance Educ. 2024, 45, 579–605. [Google Scholar] [CrossRef]
Luo, N.; Li, H.D.; Zhao, L.; Wu, Z.N.; Zhang, J. Promoting Student Engagement in Online Learning Through Harmonious Classroom Environment. Asia-Pac. Educ. Res. 2022, 31, 541–551. [Google Scholar] [CrossRef]
Heo, H.; Bonk, C.J.; Doo, M.Y. Enhancing learning engagement during COVID-19 pandemic: Self-efficacy in time management, technology use, and online learning environments. J. Comput. Assist. Learn. 2021, 37, 1640–1652. [Google Scholar] [CrossRef]
Han, S.Y.; Zhang, Z.L.; Liu, H.; Kong, W.L.; Xue, Z.C.; Cao, T.H.; Shu, J.B. From engagement to performance: The role of effort regulation in higher education online learning. Interact. Learn. Environ. 2023, 32, 6607–6627. [Google Scholar] [CrossRef]
Ferrer, J.; Ringer, A.; Saville, K.; Parris, M.A.; Kashi, K. Students’ motivation and engagement in higher education: The importance of attitude to online learning. High. Educ. 2022, 83, 317–338. [Google Scholar] [CrossRef]
Shen, J.; Yang, H.; Li, J.; Cheng, Z. Assessing learning engagement based on facial expression recognition in MOOC’s scenario. Multimed. Syst. 2022, 28, 469–478. [Google Scholar] [CrossRef]
Aly, M.; Ghallab, A.; Fathi, I.S. Enhancing Facial Expression Recognition System in Online Learning Context Using Efficient Deep Learning Model. IEEE Access 2023, 11, 121419–121433. [Google Scholar] [CrossRef]
Fang, B.; Li, X.; Han, G.; He, J. Facial Expression Recognition in Educational Research From the Perspective of Machine Learning: A Systematic Review. IEEE Access 2023, 11, 112060–112074. [Google Scholar] [CrossRef]
Maqableh, W.; Alzyoud, F.Y.; Zraqou, J. The use of facial expressions in measuring students’ interaction with distance learning environments during the COVID-19 crisis. Vis. Inform. 2023, 7, 1–17. [Google Scholar] [CrossRef]
Xiong, Y.; Zhou, S.; Wang, J.; Guo, T.; Cai, L. A Personalized Multi-region Perception Network for Learner Facial Expression Recognition in Online Learning. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Proceedings of the 25th International Conference, AIED 2024, Recife, Brazil, 8–12 July 2024; Springer: Cham, Switzerland, 2024; pp. 435–443. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant—A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Lee, C.-C.; Shih, C.-Y.; Lai, W.-P.; Lin, P.-C. An improved boosting algorithm and its application to facial emotion recognition. J. Ambient Intell. Humaniz. Comput. 2012, 3, 11–17. [Google Scholar] [CrossRef]
Holder, R.P.; Tapamo, J.R. Improved gradient local ternary patterns for facial expression recognition. EURASIP J. Image Video Process. 2017, 2017, 42. [Google Scholar] [CrossRef]
Bellamkonda, S.; Gopalan, N. A facial expression recognition model using support vector machines. IJ Math. Sci. Comput. 2018, 4, 56–65. [Google Scholar] [CrossRef]
Eng, S.; Ali, H.; Cheah, A.; Chong, Y. Facial expression recognition in JAFFE and KDEF Datasets using histogram of oriented gradients and support vector machine. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Kazimierz Dolny, Poland, 21–23 November 2019; p. 012031. [Google Scholar]
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Jain, N.; Kumar, S.; Kumar, A.; Shamsolmoali, P.; Zareapoor, M. Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 2018, 115, 101–106. [Google Scholar] [CrossRef]
Sari, M.; Moussaoui, A.; Hadid, A. A Simple Yet Effective Convolutional Neural Network Model to Classify Facial Expressions. In Proceedings of the Modelling and Implementation of Complex Systems, Batna, Algeria, 24–26 October 2021; pp. 188–202. [Google Scholar]
Liu, P.; Han, S.; Meng, Z.; Tong, Y. Facial Expression Recognition via a Boosted Deep Belief Network. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1805–1812. [Google Scholar]
Zhao, X.; Shi, X.; Zhang, S. Facial Expression Recognition via Deep Learning. IETE Tech. Rev. 2015, 32, 347–355. [Google Scholar] [CrossRef]
Roy, A.K.; Kathania, H.K.; Sharma, A.; Dey, A.; Ansari, M.S.A. ResEmoteNet: Bridging Accuracy and Loss Reduction in Facial Emotion Recognition. IEEE Signal Process. Lett. 2025, 32, 491–495. [Google Scholar] [CrossRef]
Bharathi, S.; Hari, K.; Senthilarasi, M. Expression Recognition using YOLO and Shallow CNN Model. In Proceedings of the 2022 Smart Technologies, Communication and Robotics (STCR), Sathyamangalam, India, 10–11 December 2022; pp. 1–5. [Google Scholar]
Vanamoju, S.V.M.D.; Vineetha, M.V.; Tekchandani, H.; Joshi, P.; Shukla, P.K.; Khanna, A. Facial Emotion Recognition using YOLO based Deep Learning Classifier. In Proceedings of the 2024 First International Conference on Electronics, Communication and Signal Processing (ICECSP), New Delhi, India, 8–10 August 2024; pp. 1–5. [Google Scholar]
Parambil, M.M.A.; Ali, L.; Swavaf, M.; Bouktif, S.; Gochoo, M.; Aljassmi, H.; Alnajjar, F. Navigating the YOLO Landscape: A Comparative Study of Object Detection Models for Emotion Recognition. IEEE Access 2024, 12, 109427–109442. [Google Scholar] [CrossRef]
Hasan, M.; Lazem, A. Facial Human Emotion Recognition by Using YOLO Faces Detection Algorithm. Cent. Asian Stud. 2023, 6, 32–38. [Google Scholar] [CrossRef]
Hasnine, M.N.; Bui, H.T.T.; Thu Tran, T.T.; Nguyen, H.T.; Akçapınar, G.; Ueda, H. Students’ emotion extraction and visualization for engagement detection in online learning. Procedia Comput. Sci. 2021, 192, 3423–3431. [Google Scholar] [CrossRef]
Ayouni, S.; Hajjej, F.; Maddeh, M.; Al-Otaibi, S. A new ML-based approach to enhance student engagement in online environment. PLoS ONE 2021, 16, e0258788. [Google Scholar] [CrossRef] [PubMed]
Sharma, P.; Joshi, S.; Gautam, S.; Maharjan, S.; Khanal, S.R.; Reis, M.C.; Barroso, J.; de Jesus Filipe, V.M. Student Engagement Detection Using Emotion Analysis, Eye Tracking and Head Movement with Machine Learning. In Proceedings of the Technology and Innovation in Learning, Teaching and Education, Lisbon, Portugal, 31 August–2 September 2022; pp. 52–68. [Google Scholar]
Su, Y.S.; Suen, H.Y.; Hung, K.E. Predicting behavioral competencies automatically from facial expressions in real-time video-recorded interviews. J. Real-Time Image Process. 2021, 18, 1011–1021. [Google Scholar] [CrossRef]
Teixeira, T.; Granger, E.; Koerich, A.L. Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks. Appl. Sci. 2021, 11, 11738. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Hosna, A.; Merry, E.; Gyalmo, J.; Alom, Z.; Aung, Z.; Azim, M.A. Transfer learning: A friendly introduction. J. Big Data 2022, 9, 102. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Seo, Y.; Kim, J.; Park, U. Swish-T: Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance. arXiv 2024, arXiv:2407.01012. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv 2015, arXiv:1505.00853. [Google Scholar]
Dar, T.; Javed, A.; Bourouis, S.; Hussein, H.S.; Alshazly, H. Efficient-SwishNet Based System for Facial Emotion Recognition. IEEE Access 2022, 10, 71311–71328. [Google Scholar] [CrossRef]
Chen, L.; Fei, H.; Xiao, Y.; He, J.; Li, H. Why batch normalization works? a buckling perspective. In Proceedings of the 2017 IEEE International Conference on Information and Automation (ICIA), Macau, China, 18–20 July 2017; pp. 1184–1189. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Lyons, M.J.; Akamatsu, S.; Kamachi, M.; Gyoba, J.; Budynek, J. The Japanese female facial expression (JAFFE) database. In Proceedings of the Third International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 14–16. [Google Scholar]
Calvo, M.G.; Lundqvist, D. Facial expressions of emotion (KDEF): Identification under different display-duration conditions. Behav. Res. Methods 2008, 40, 109–115. [Google Scholar] [CrossRef] [PubMed]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Zhang, X. Research on Facial Expression Recognition Based on YOLOv9. In Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 August 2024; pp. 208–213. [Google Scholar]
Jabbooree, A.I.; Khanli, L.M.; Salehpour, P.; Pourbahrami, S. A novel facial expression recognition algorithm using geometry β –skeleton in fusion based on deep CNN. Image Vis. Comput. 2023, 134, 104677. [Google Scholar] [CrossRef]
Rashad, M.; Alebiary, D.M.; Aldawsari, M.; El-Sawy, A.A.; AbuEl-Atta, A.H. CCNN-SVM: Automated Model for Emotion Recognition Based on Custom Convolutional Neural Networks with SVM. Information 2024, 15, 384. [Google Scholar] [CrossRef]
Dubey, A.K.; Jain, V. Automatic facial recognition using VGG16 based transfer learning model. J. Inf. Optim. Sci. 2020, 41, 1589–1596. [Google Scholar] [CrossRef]
Ahadit, A.B.; Jatoth, R.K. A Novel Dual CNN Architecture with LogicMax for Facial Expression Recognition. J. Inf. Sci. Eng. 2021, 37, 15. [Google Scholar] [CrossRef]
Kartheek, M.N.; Prasad, M.V.N.K.; Bhukya, R. Windmill Graph based Feature Descriptors for Facial Expression Recognition. Optik 2022, 260, 169053. [Google Scholar] [CrossRef]
Appasaheb Borgalli, M.R.; Surve, D.S. Deep learning for facial emotion recognition using custom CNN architecture. J. Phys. Conf. Ser. 2022, 2236, 012004. [Google Scholar] [CrossRef]
Chen, G.; Peng, J.; Zhang, W.; Huang, K.; Cheng, F.; Yuan, H.; Huang, Y. A Region Group Adaptive Attention Model For Subtle Expression Recognition. IEEE Trans. Affect. Comput. 2023, 14, 1613–1626. [Google Scholar] [CrossRef]
Yaddaden, Y. An efficient facial expression recognition system with appearance-based fused descriptors. Intell. Syst. Appl. 2023, 17, 200166. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, X.; Tang, Y. Facial expression recognition based on improved residual network. IET Image Process. 2023, 17, 2005–2014. [Google Scholar] [CrossRef]
Mukhopadhyay, M.; Dey, A.; Kahali, S. A deep-learning-based facial expression recognition method using textural features. Neural Comput. Appl. 2023, 35, 6499–6514. [Google Scholar] [CrossRef]
Jabbooree, A.I.; Khanli, L.M.; Salehpour, P.; Pourbahrami, S. Geometrical facial expression recognition approach based on fusion CNN-SVM. Int. J. Intell. Eng. Syst 2024, 17, 457–468. [Google Scholar]

Figure 1. The overall architecture proposed for FER and for detecting students’ engagement.

Figure 2. The VGG-16 model architecture after fine-tuning.

Figure 3. Sample facial expression images from JAFFE dataset.

Figure 4. Sample facial expression images from KDEF dataset.

Figure 5. Sample facial expression images from CK+ dataset.

Figure 6. Confusion matrix of proposed model on JAFFE dataset.

Figure 7. Confusion matrix of proposed model on KDEF dataset.

Figure 8. Confusion matrix of proposed model on CK+ dataset.

Figure 9. A validation loss comparison of seven activation functions using the proposed model on the JAFFE dataset.

Figure 10. A validation loss comparison of seven activation functions using the proposed model on the KDEF dataset.

Figure 11. A validation loss comparison of seven activation functions using the proposed model on the CK+ dataset.

Figure 12. Engagement detection of nine students.

Figure 13. Distribution of participant engagement levels detected using advanced models.

Table 1. Emotion weights.

Emotion	Neutral	Happy	Surprised	Sad	Fearful	Angry	Disgusted
Emotion weight	0.9	0.6	0.6	0.3	0.3	0.25	0.2

Table 2. Engagement detection from concentration index.

Engagement Type	Concentration Index (CI)
Highly engaged	>50%
Nominally engaged	20–50%
Not engaged	<20%

Table 3. Training hyperparameters.

Parameter	Value
Optimizer	adam
MaxEpochs	30
InitialLearnRate	0.0001
LearnRateSchedule	piecewise
LearnRateDropPeriod	10
LearnRateDropFactor	0.5
MiniBatchSize	16
ValidationFrequency	30

Table 4. Precision, recall, and F1-score of proposed model on each emotion type for JAFFE, KDEF, and CK+ datasets.

	JAFFE			KDEF			CK+
Emotion Type	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Angry	1.00	1.00	1.00	0.95	0.98	0.96	1.00	1.00	1.00
Disgusted	1.00	0.86	0.92	0.94	0.95	0.95	1.00	1.00	1.00
Fearful	1.00	1.00	1.00	0.89	0.93	0.91	1.00	1.00	1.00
Happy	1.00	1.00	1.00	1.00	0.97	0.99	1.00	1.00	1.00
Neutral	1.00	1.00	1.00	1.00	0.97	0.98	1.00	1.00	1.00
Sad	0.83	1.00	0.91	0.91	0.93	0.92	1.00	1.00	1.00
Surprised	1.00	1.00	1.00	0.98	0.94	0.96	1.00	1.00	1.00
Macro Average	0.97	0.98	0.97	0.95	0.95	0.95	1.00	1.00	1.00

Table 5. The test accuracy of the proposed model on the JAFFE, KDEF, and CK+ datasets using seven activation functions.

	Test Accuracy
Activation Function	JAFFE	KDEF	CK+
ReLU	88.37%	93.27%	97.98%
Swish	97.67%	95.20%	100%
Softplus	95.35%	94.69%	97.20%
ELU	93.02%	95.41%	98.79%
LeakyReLU	90.70%	94.29%	99.19%
ClippedReLU	81.40%	92.24%	98.38%
GeLU	86.10%	92.45%	99.60%

Table 6. A comparison of the accuracies achieved with advanced models on the JAFFE, KDEF, and CK+ datasets in the 80–20% case.

	80–20% Case
Model	JAFFE	KDEF	CK+
YOLOv9t	62.50%	68.16%	73.68%
YOLOv9s	65.62%	72.67%	74.09%
YOLOv9m	81.25%	83.27%	80.57%
YOLOv9c	78.12%	75.31%	79.35%
YOLOv9e	68.75%	79.39%	77.33%
ViT-Base/16	82.87%	79.91%	95.35%
Proposed model	97.67%	95.20%	100%

Table 7. A comparison of the accuracies achieved with advanced models on the JAFFE, KDEF, and CK+ datasets with 10-fold CV.

	10-Fold CV
Model	JAFFE	KDEF	CK+
YOLOv9t	56.25% ± 3.14%	72.65% ± 0.36%	88.11% ± 0.41%
YOLOv9s	59.38% ± 4.15%	77.65% ± 0.37%	90.42% ± 0.29%
YOLOv9m	68.22% ± 1.34%	87.86% ± 0.28%	94.58% ± 0.32%
YOLOv9c	71.88% ± 2.25%	87.45% ± 0.35%	95.55% ± 0.31%
YOLOv9e	68.75% ± 1.76%	88.47% ± 0.44%	92.45% ± 0.26%
ViT-Base/16	80.65% ± 3.26%	92.46% ± 0.24%	97.67% ± 0.28%
Proposed model	95.84% ± 1.45%	95.35% ± 0.34%	99.60% ± 0.21%

Table 8. A comparison between the proposed method and existing methods on the JAFFE, KDEF, and CK+ datasets.

	Total of Samples				Test Accuracy
Method	JAFFE	KDEF	CK+	Data Splitting Method	JAFFE	KDEF	CK+
VGG-16 + TL [52]	213	–	4000	50–50%	93.7%	–	94.8%
VGG-16+ LogicMax [53]	213	–	1479	10-Fold cv	94.86%	–	98.62%
Improved CNN [23]	213	490	981	80–20%	86.24%	82.38%	88.23%
Efficient-SwishNet [43]	213	4900	1200	80–20%	95.02%	85.5%	100%
	–	980	–	80–20%	–	88.3%	–
WGFD + SVM [54]	213	490	–	LOSO	66.20%	83.47%	–
	–	–	1296	10-Fold cv	–	–	90.59%
Custom CNN [55]	213	–	654	10-Fold cv	91.58%	–	92.27%
Region Group Adaptive Attention Model [56]	213	4900	981	80–20%	95.20%	93.47%	99.59%
Appearance Fusion + SVM [57]	207	980	–	10-Fold cv	97.16%	90.12%	–
Improved Residual Network [58]	–	980	981	80–20%	–	93.38%	96.37%
LBP-CNN [59]	213	–	331	80–20%	75.0%	–	79.5%
LTP-CNN [59]	213	–	331	80–20%	77.3%	–	89.2%
CLBP-CNN [59]	213	–	331	80–20%	81.2%	–	91%
Fusion CNN [50]	213	4900	981	80–20%	93.07%	90.30%	98.22%
Fusion CNN-SVM [60]	213	–	981	80–20%	89.23%	–	96.19%
CCNN-SVM [51]	213	4900	974	80–20%	98.4%	90.43%	99.3%
	213	4900	974	5-Fold cv	96.62%	87.18%	99.3%
Our Proposed Method	213	4900	1236	80–20%	97.67%	95.20%	100%
Our Proposed Method	213	4900	1236	10-Fold cv	95.84%	95.35%	99.60%

Table 9. Performance metrics of advanced models for engagement detection.

Model	Total Processing Time(s)	FLOPs (G)	Parameters (M)
YOLOv9t	175.20	7.7	2.1
YOLOv9s	267.63	26.4	7.3
YOLOv9m	274.84	76.3	20.1
YOLOv9c	296.24	102.1	25.5
YOLOv9e	665.64	189.0	58.1
ViT-Base/16	219.73	17.2	86.4
Proposed model	178.59	15.48	138.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, Q.; Wang, M.; Li, Y. Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments. Symmetry 2025, 17, 711. https://doi.org/10.3390/sym17050711

AMA Style

Yao Q, Wang M, Li Y. Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments. Symmetry. 2025; 17(5):711. https://doi.org/10.3390/sym17050711

Chicago/Turabian Style

Yao, Qiaohong, Mengmeng Wang, and Yubin Li. 2025. "Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments" Symmetry 17, no. 5: 711. https://doi.org/10.3390/sym17050711

APA Style

Yao, Q., Wang, M., & Li, Y. (2025). Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments. Symmetry, 17(5), 711. https://doi.org/10.3390/sym17050711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual Geometry Group-SwishNet-Based Asymmetric Facial Emotion Recognition for Multi-Face Engagement Detection in Online Learning Environments

Abstract

1. Introduction

2. Related Works

2.1. Facial Expression Recognition

2.2. Engagement Detection

3. Methodology

3.1. Preliminary

3.2. Proposed Architecture for Facial Emotion Recognition and Engagement Detection

3.2.1. Facial Emotion Recognition

3.2.2. Engagement Detection

3.3. Facial Expression Datasets

3.3.1. Japanese Female Facial Expression Dataset (JAFFE)

3.3.2. Karolinska Directed Emotional Faces Dataset (KDEF)

3.3.3. Extended Cohn-Kanade Dataset (CK+)

3.4. Performance Metrics

4. Experimental Results and Discussion

4.1. Image Preprocessing

4.2. Experimental Setup

4.3. Experimental Analysis and Comparison

4.3.1. Performance Evaluation on Datasets

4.3.2. The Impact of the Activation Function on the Model’s Accuracy

4.3.3. Comparison of Performance with Advanced Models

4.3.4. Comparison of Performance with Prior Studies

4.4. Result Visualization of Student Engagement Detection and Comparative Analysis

4.4.1. Experimental Setup

4.4.2. Experimental Results

4.4.3. Semi-Structured Interviews

4.4.4. Comparative Analysis of Advanced Models

4.4.5. Real-Time Testing and Comparison of Advanced Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI