Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”

Zheng, Xianwen; Hasegawa, Shinobu; Gu, Wen; Ota, Koichi

doi:10.3390/educsci14060556

Open AccessArticle

Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”

¹

Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Ishikawa 923-1292, Japan

²

Center for Innovative Distance Education and Research, Japan Advanced Institute of Science and Technology, Ishikawa 923-1292, Japan

^*

Authors to whom correspondence should be addressed.

Educ. Sci. 2024, 14(6), 556; https://doi.org/10.3390/educsci14060556

Submission received: 27 April 2024 / Revised: 17 May 2024 / Accepted: 20 May 2024 / Published: 22 May 2024

Download

Browse Figures

Versions Notes

Abstract

Disengagement of students during online learning significantly impacts the effectiveness of online education. Thus, accurately estimating when students are not engaged is a critical aspect of online-learning research. However, the inherent characteristics of public datasets often lead to issues of class imbalances and data insufficiency. Moreover, the instability of video time-series data further complicates data processing in related research. Our research aims to tackle class imbalances and instability of video time-series data in estimating learner engagement, particularly in scenarios with limited data. In the present paper, we introduce “Skipped Moving Average”, an innovative oversampling technique designed to augment video time-series data representing disengaged students. Furthermore, we employ long short-term memory (LSTM) and long short-term memory fully convolutional network (LSTM-FCN) models to evaluate the effectiveness of our method and compare it to the synthetic minority over-sampling technique (SMOTE). This approach ensures a thorough evaluation of our method’s effectiveness in addressing video time-series data imbalances and in enhancing the accuracy of engagement estimation. The results demonstrate that our proposed method outperforms others in terms of both performance and stability across sequence deep learning models.

Keywords:

engagement estimation; time-series data; class imbalances data; over sampling; online learning

1. Introduction

In today’s digital age, online education has become integral to learning. In this context, engagement, defined as a state of mind that can help learners feel positive and realize quality learning [1], is crucial to ensure that they remain motivated and connected throughout their educational journey. Therefore, it is recognized in Kage’s work [1] that engagement directly affects the effectiveness of online learning and self-paced courses. Dewan et al. [2] and Hollister et al. [3] propose that learners can quickly lose their engagement during online learning, making it difficult to maintain a high level of engagement. Empirical studies [4,5,6,7] highlight that learners often have difficulty maintaining a consistent level of engagement, in part due to limited interaction opportunities and a lack of diverse, compelling engagement strategies.

In addition, there are few meaningful strategies to help students maintain high levels of engagement. Studies focusing on learners’ perceptions of engagement strategies in online learning have identified three categories: learner-to-learner, learner-to-instructor, and learner-to-content [7], of which, students appear to value learner-to-instructor engagement the most because it better promotes student satisfaction and motivation. Therefore, it is essential that instructors understand learner engagement in online learning.

In particular, estimating/detecting low learner engagement during online learning is a critical challenge in providing appropriate support. To address these issues, diverse researchers have proposed machine learning approaches to estimate learner engagement as a challenging task [8]. Such research typically uses public datasets for engagement estimation/detection, such as “in the wild” [9] or DAiSEE [10]. This is due to the high cost of constructing datasets annotated with engagement during training and the difficulty of making fair performance comparisons with closed datasets. These datasets were collected in a “wild” setting. This makes it challenging to estimate engagement for low-illumination face images, facial occlusion, etc. Additionally, as is known from the Hawthorne effect, the participants might change their behavior and maintain good engagement by being aware that they are recorded as part of experiments [11]. Therefore, such datasets suffer from the problem of class imbalance, where data with low engagement are relatively small. Because of the high complexity of the low-engagement data, it is impossible to model the minority classes during the machine/deep learning process. As a result, minority data are not effectively classified due to the influence of majority data [12]. This affects the accuracy of engagement estimation in this research area [13].

This article introduces an original preprocessing approach called “Skipped Moving Average”, which not only preserves the integrity of the original video data but also captures its temporal dynamics and variation to address this issue. This method aims to mitigate the problems caused by video data imbalances in time-series analysis.

2. Related Work

2.1. Definition of Engagement

Engagement is commonly defined in educational terms as encompassing three aspects: cognitive, behavioral, and emotional. Cognitive engagement refers to the thoughtfulness and willingness to expend the effort necessary to understand complex ideas and master difficult skills [2,14]. Behavioral engagement is based on the concept of participation, which includes engaging in classroom and extracurricular activities, staying focused, completing assigned work, and following an instructor’s directions [15]. Emotional engagement includes positive and negative reactions to instructors, classmates, and learning content [7], which are thought to foster connections with the instructor and influence learners’ willingness to engage in learning activities.

Owing to the significance of learner-to-instructor strategies, this study focuses on estimating emotional engagement during online learning. We define emotional engagement as the emotional feedback learners exhibit towards learning content and instructors in online education. This encompasses whether students are actively and positively focused on the learning process.

2.2. Approaches in Emotional Engagement Estimation

In recent studies, several prevalent methods have been used to acquire data for analyzing student engagement. These include those based on learning log files, external devices, and computer vision. Learning log files [16] are particularly suitable for cognitive and behavioral engagement analysis because of their relatively long periods of relevance. The analysis methods for sensor data from external devices, such as EEG, blood pressure, heart rate, or galvanic skin response, can achieve high accuracy [17,18,19]. However, they have limited generalizability and are unsuitable for real educational settings. In addition, using keyboard-and-mouse activity [20] as a measure of online learning engagement does not apply to learners who take online classes with iPads and mobile phones. A key feature of online learning is its flexibility in terms of device and location, so such methods ignore several of the advantages of online learning. This highlights the advantages of computer-vision-based methods in engagement research.

Psychological research has shown that facial expressions and body posture are important channels for conveying emotions and thoughts [21,22]. Consequently, studying external expressions constitutes an important approach to estimating/detecting learners’ emotional engagement. In this study, we adopt a computer-vision-based approach to extract external features that enable the analysis of learners’ engagement.

2.3. Computer-Vision-Based Features

In computer-vision-based studies of engagement in online learning, the most commonly used features are facial expression, gesture, posture, and eye movement. Action Units (AUs) [23], Local Binary Patterns (LBPs) [24], and Histogram of Oriented Gradients (HOGs) [25] are found to be popular for facial-expression recognition and engagement estimation/detection [2]. These methods have achieved significant success in related research. However, the use of facial-expression features alone in engagement estimation/detection research presents certain limitations, so some studies have incorporated gestural and postural features to improve the accuracy of engagement detection.

Chang et al. [26] used OpenPose to track information about head, body, and hand movements and to capture changes in subjects’ body postures. The frequency of the appearance of the hands and the distance between the nose and the neck were taken to represent hand and body movements, respectively. Fewer restless movements indicate a higher level of engagement intensity, while more restless movements indicate a lower level of engagement intensity. Regarding eye information, some studies have used eye trackers and existing libraries such as OpenFace to obtain eye information data. However, eye trackers are not considered in our study because they are external devices. The eye information provided by libraries like OpenFace is fixed, and some studies [27] have also shown that information such as brow raising, brow lowering, and eyelid tightening are strongly correlated with learner engagement.

Observation of raw videos in the DAiSEE dataset has shown that the body appears tightly closed when the learner is at a low level of engagement. As the engagement level increases, the limbs and torso become more stretched. The head pose at low engagement levels is often tilted, whereas at high engagement levels, the head is held upright and adopts a serious expression. Therefore, the frequency of hand and body movements alone does not provide a complete representation of engagement. In addition, in online learning, learners exhibit a range of actions, such as leaning toward or away from the screen; turning their bodies, shoulders, and faces; and supporting their face or hair with their hands. In this case, we need better computer-vision-based features to reflect students’ engagement in online learning.

2.4. Dataset

Due to the specificity of learner engagement, there are relatively high demands for labels, video duration, data volume, and learning content details in datasets pertinent to this area of research. However, many related studies that have achieved certain research results have used non-public datasets. Collecting a large dataset that meets the aforementioned criteria within a short period is a challenging and often costly endeavor. Therefore, existing public datasets play a foundational role in propelling research in this area, and maximizing the utility of such datasets also poses significant challenge.

Specific literature reviews [2] list several publicly available and annotated datasets. The most commonly used public datasets for computer-vision-based engagement estimation/detection are “in the wild” and DAiSEE. These datasets have been used in several related works, as mentioned above. Kaur et al. [9] created a new “in the wild” dataset (published in 2018) with video recordings of participants watching stimulus videos. This collection includes videos labeled with engagement levels by a team of five annotators. The videos were captured using the Skype application at the other end of a Skype video call with a Microsoft Lifecam wide-angle F2.0 camera (excluding recordings made directly through Skype) to simulate various environments such as frame drops, network latency, and interference. There are 91 subjects (27 females and 64 males) in the dataset, with a total of 264 videos, each approximately five minutes in length, collected in various locations such as computer labs, dorm rooms, and open spaces. The videos in the dataset were labeled based on their engagement intensity, ranging from 0 to 3, with (0) not engaged, (1) less engaged, (2) engaged, and (3) highly engaged.

Gupta et al. [10] presented the DAiSEE dataset, which contains video recordings of learners in online learning courses annotated with crowdsourced engagement labels. They used a high-definition webcam mounted on a computer to capture the states of students as they viewed online courses for data collection. The dataset includes 112 subjects of Asian ethnicity, 32 females and 80 males, between the ages of 18 and 30. This dataset includes 9068 video snippets recorded in six different locations and under three different lighting conditions, which were chosen to reflect the variety of environments that students might be in while engaged in online learning. Each video in the dataset is 10 s long and is assigned a unique identification number with a label indicating the level of engagement, frustration, confusion, and boredom. However, in this research, we used only the engagement label as their inner state. The entire dataset categorizes these engagement levels into four levels: (1) very low, (2) low, (3) high, and (4) very high.

Table 1 provides a comparison of the basic information for the “in the wild” [9] and DAiSEE [10] datasets. The advantage of the DAiSEE dataset is its larger number of subjects, greater data volume, and longer total duration compared with the other dataset. Moreover, a crucial difference is that the “in the wild” dataset shows engagement with five-minute videos, while the DAiSEE dataset shows engagement with 10-s videos. In terms of estimating engagement in online learning, which can change moment-to-moment, it is important to derive meaningful insight from short videos.

Figure 1 shows the distribution of the four levels of engagement labels in the “in the wild” and DAiSEE datasets. Both datasets share a large difference in the number of data per class by engagement level. In particular, the amount of low engagement, such as “not engaged” and “very low”, is significantly small. The following section explains how this affects engagement estimation in the previous research.

2.5. Architectures in Emotional Engagement Estimation

In this section, we discuss some previous studies that used class-imbalanced datasets to estimate/detect engagement though computer-vision-based methods. In their 2018 study on the “in the wild” dataset, Chang et al. [26] proposed an ensemble framework that integrates three cluster-based conventional models and an attention-based NN model enhanced with heuristic rules to predict learners’ engagement levels while watching Massive Open Online Course (MOOC) videos. The classwise mean square error (MSE) results of their study from engagement levels 0–3 were 0.263, 0.079, 0.032, and 0.136, respectively. The best performance in this model was Level 2 with an MSE of 0.032, and the worst performance was Level 0 with an MSE of 0.263, mainly due to the imbalanced dataset favoring the majority classes.

In their 2022 study using the DAiSEE dataset, Villaroya et al. [30] focused on the creation of an automated engagement detection system using facial features such as head position, gaze direction, facial expression, and distance from the user to the recording RGB camera. The system development used the Random Forest algorithm as the primary machine learning technique. The F1 scores of the classifier evaluation from very low to very high engagement level labels were 0.671, 0.742, 0.890, and 0.860, respectively. However, the task performed here was different from our research because it did not estimate engagement from 10-s videos, but rather from shorter decomposed videos. The observed differentiation in the result could be attributed to the unbalanced input dataset DAiSEE, which was used in their study.

Dresvyanskiy et al. [13] employed a range of augmentation and class-balancing strategies coupled with a fusion of emotion-based and attention-based deep embedding. They then modeled these fused features over time to develop a dependable engagement recognition system using facial imagery. Furthermore, they introduced a novel baseline metric, advocating for a baseline performance assessment grounded in the unweighted average recall (URA) metric. The overall performance of the model in their study achieved an accuracy of 39.02% and a UAR metric of 44.27%.

In the above investigations and review papers [2], we found that class imbalances and insufficient data are common issues in many studies. In order to explore this research problem in our preliminary experiments [31,32], we adopted long short-term memory (LSTM) and quasi-recurrent neural network (QRNN) sequence models to estimate engagement estimation using time-series facial and body key point information. We examined the adopted deep learning methods by the DAiSEE dataset and combined very low and low engagement as the same label. LSTM and QRNN achieved accuracies for these three engagement levels of 0.050, 0.740, and 0.410 and 0.000, 0.930, and 0.070, respectively. The F1 scores for the three engagement levels were 0.090, 0.640, and 0.470 for LSTM and 0.000, 0.660, and 0.120 for QRNN, respectively.

We reproduced the following study and converted the regression into classification results. Ai et al. [33] introduced a comprehensive end-to-end framework Class Attention in Video Transformer, for predicting engagement intensity. The architecture was based on self-attention between patches and class attention between class tokens and patches. To effectively combat the challenge of insufficient training samples, they developed a binary order representative sampling method, significantly enhancing the model’s ability to predict engagement intensity. They achieved the state-of-the-art MSE (0.049) and MSE (0.037) for the “in the wild” and DAiSEE datasets, respectively. After converting the regression results into classification and combining very low and low engagement labels as our preliminary experiments, the results were 0.571, 0.789, and 0.667, with F1 scores of 0.696, 0.682, and 0.690 for the “in the wild” dataset. For the DAiSEE dataset, the results were 0.068, 0.732, and 0.421, with F1 scores of 0.122, 0.625, and 0.489, respectively.

Based on the characteristics of the dataset, the results were relatively good even with low engagement for “in the wild”, yet not so good for DAiSEE. This outcome aligns with the findings of the related studies mentioned previously [2,26,30,33] and corroborates the observations from our preliminary experiments [31,32] as shown in Table 2, underscoring the persisting challenge of class imbalanced data in recent research. This persistent issue highlights the need for further investigation and solution development in this area.

2.6. Issues Addressed

The investigations in Section 2.5 used different datasets, i.e., DAiSEE and “in the wild” datasets, but they consistently encountered the following problems:

RQ1: How do we deal with class imbalanced datasets, DAiSEE?

Class imbalance also leads to inconsistency in the experimental results of engagement classification. This results in the inability to estimate/detect different engagement levels of learners in real-time online learning education. Moreover, low engagement needs to be estimated in a real context, but using these datasets leads to inaccuracy owing to the paucity of low engagement data. Thus, obtaining data on learner disengagement is particularly challenging, critically impacting the advancement of future research in this area [34].

RQ2: How does the proposed method affect the accuracy of engagement estimation?

Other research challenges are insufficient data and the oversimplification of training samples. This often leads to suboptimal training results and causes the model to overfit the training and validation sets, hindering its generalizability and performance [26,35].

This study proposes an over-sampled data preprocessing method to solve the problem of class imbalanced video time-series data in research on learners’ online learning engagement. Considering the real-time variability of learners’ engagement, we will use the DAiSEE dataset to examine our proposed method.

3. Proposed Methods

3.1. Sampling Method

The class imbalance problem is a challenge that needs to be solved in many research areas. Resampling techniques are common approaches to balancing datasets. Oversampling techniques increase the minority class by replicating existing instances or generating new ones, thereby achieving a balanced dataset [36]. One such method that has been widely used is the synthetic minority over-sampling technique (SMOTE) [37]. However, some disadvantages of SMOTE, such as oversampling of uninformative samples, noise interference, and blindness of neighbor selection, also remain to be addressed [38]. Therefore, we propose a novel oversampling method, “Skipped Moving Average”, tailored for video time-series data to address the problem of data class imbalances.

3.1.1. Skipped Moving Average and Video Frame Downsampling

Skipped Moving Average is a data preprocessing technique developed for video time-series data, aimed at addressing the issue of data imbalances. This method reduces redundancy and smooths video data by applying a moving average to the original video frames. In the DAiSEE dataset, each video sample is ten seconds, with a quality of 1920 × 1080 at 30 frames per second (fps) [10]. This means that each sample contains a total of 300 frames. Engagement is a sustained affective state rather than fleeting expressions [1], which may not require capturing data at such high frame rates. Additionally, the latency of deep learning models in processing this data also needs to be considered [39]. Thus, having 30 fps and 300 frames per ten seconds may be redundant in terms of both frequency and time span for real-time estimation of student engagement in online courses. Therefore, the first step was to extract the number of frames in the sample videos using the Skipped Moving Average method.

First, considering that the video is 10 s, the sequence timesteps are set to 10. To ensure that data after sampling are an integer multiple, the possible moving average windows that can be used are 2, 3, 5, 6, and 10 frame averages, which correspond to oversampling the low engagement data by 15-, 10-, 6-, 5-, and 3-fold sampling rates, respectively. Among them, averaging 2 frames with a 15-fold sampling rate is too high, resulting in 7800 samples after sampling. Averaging 10 frames with a 3-fold sampling rate is too low, resulting in 1560 samples after sampling. Both of these settings will lead to data imbalance again. Therefore, it is feasible to take averages with 3, 5, and 6 frames, corresponding to 10-, 6-, and 5-fold sampling rates, respectively. In our preliminary experiments, as shown in Table 3, we found that averaging 3 frames with a 10-fold sampling rate resulted in slight overfitting for low-engagement outcomes, leading to unstable performance in recall and F1 scores during testing. Considering the need to achieve a balanced sampling rate as much as possible, we decided to abandon averaging 6 frames in favor of averaging 5 frames with a 6-fold sampling rate, which we used as the moving window value for our study. However, this study identified the best parameters for the LSTM condition in the DAiSEE dataset, but this does not imply that the 5-frame average window is universally applicable. Our approach aims to identify the best settings under current conditions, which means that it can be used as a reference for other researchers.

We set 5 frames as the moving average window. By averaging every 5 frames from the 300 frames, we obtain 60 average values per sample video from the DAiSEE dataset. Thus, this method segments a sample video with 300 frames into 60 sequences. Figure 2 illustrates the Skipped Moving Average method applied to resampling video frames from the DAiSEE dataset, reducing 300 frames to 60 sequences by averaging every 5 frames.

3.1.2. Average Oversampling Input Videos

There are four levels-of-engagement labels: (1) very low, (2) low, (3) high, and (4) very high in the DAiSEE dataset. In Table 4, the “Original Labels” row shows the number of video data for the four original engagement labels provided by the dataset. We observed that the proportion of data for the labels very-low and low was excessively small, reaching a highly imbalanced ratio. Given that our research primarily aims to identify when learners disengage, the four-level classification appeared overly detailed. Some studies have shown that videos labeled very-low and low are very similar [40]. Therefore, we combined the data for the very low and low labels into the low-level engagement label, shown as “Relabel” in Table 4. There are several other examples of integrating very low and low labels in this way [31,32,41].

From the previous step, we obtained 60 sequences from a sample video. Since the sample videos are 10 s, we set the timesteps to 10, resulting in 6 sequences per second. We then sampled 1 sequence from each second to represent the data for that second. After completing the above process, each sample video is divided into 6 segments, each segment consisting of 10 timesteps. To resample the video data, all 6 segments from videos labeled as “low” were retained in their entirety. In contrast, for videos labeled “high” and “very high”, only the first segment obtained from each second was preserved to form the sample. Figure 3 illustrates the process of oversampling the DAiSEE dataset by segmenting videos into 6 segments. During the data processing phase, some video samples and labels were lost; as a result, we obtained the data presented in Table 4 “Oversample”. Table 4 summarizes the number of original labels and data merged from the “very low” and “low” labels, as well as the sample number after oversampling.

3.2. Feature Extraction Method

In related research, the closely linked terms ’emotional’ and ’affective’ are commonly used to define emotional/affective engagement [2]. Affective engagement is an emotional response towards learning, such as interest and enjoyment in a subject [42], while emotional engagement involves both positive and negative feelings towards educators, peers, and academic material [7]. Bond et al. [43] analyzed 243 studies and found the five most frequently identified indicators of affective engagement to be, in order, positive interaction with teachers and peers; enjoyment; positive attitude towards learning; interest; and motivation. Conversely, the top five indicators of disengagement were identified as frustration; disappointment; worry and anxiety; boredom; and disinterest. The theory that people’s psychological states are expressed through facial expressions, body language, and the tone and intensity of their voices is widely recognized in the field of psychology [23,44]. Moreover, research in behavioral science has revealed that body expressions play a more crucial role in nonverbal communication than previously recognized [22,45]. Especially in online learning environments, many students display limited facial and body expressions. At the same time, they may rest their face with their hands or cover parts of their face. In such instances, it becomes challenging to accurately capture facial expressions, making it difficult to analyze learners’ engagement levels correctly during online learning. That is to say, body expressions become significantly more important. Thus, incorporating an analysis of body expressions along with facial expressions provides a more comprehensive understanding of students’ emotional and affective states, which allows us to better deduce the level of their engagement in online learning. This integrated approach enhances our ability to accurately assess their level of engagement in online learning contexts.

In our study, we used OpenPose [46,47] to extract learners’ facial and body key points, and with these key points, we developed computer-vision-based facial and body expression features to analyze learners’ engagement levels during online learning. Figure 4 shows key points extracted from the raw footage of participants, as provided in the reference paper [10]. It is evident from the images that using the OpenPose method is sufficient to obtain the features in the DAiSEE dataset. Psychologists Ekman et al. developed the Facial Action Coding System (FACS) to understand emotions such as happiness, sadness, anger, surprise, fear, and disgust, each of which is associated with specific facial expressions [23]. However, it is important to note that while there is a significant correlation, the relationship between facial expressions and emotional/affective states can be influenced by context, individual differences, and cultural factors [23,48]. In addition, which facial expressions are associated with which levels of engagement is not precise in recent research studies [2]. In nonverbal communication research, facial and body expressions are crucial channels for conveying emotions, intentions, and attitudes [22]. Similarly, in studies exploring body language expression, Kleinsmith et al. reviewed studies on mapping body features to affective states [22]. They found that the oscillation and movement of body parts such as the arms, head, shoulders, elbows, and hands are strongly correlated with the expression of internal emotions. This research highlights the significant role that the dynamics of body language play in conveying psychological states, underscoring the intricate relationship between physical movements and emotional expressions.

Therefore, based on the related studies, we designed computer-vision-based face and body features that include eye information, eyebrow and lip shapes, facial rotation angles, head and body posture, distance between the face and the screen, as well as body movement. These designed features were also utilized in our previous study [31]. We found that the designed features from facial and body key points not only add flexibility to the input features but also lay the groundwork for future research. This approach opens up possibilities for analyzing how different expressions correlate with various levels of engagement and for learners tailored to individual differences.

3.3. Training Method

One of the characteristics of engagement is ongoing and changing [1]. Engagement is not a static state but constantly fluctuating, making capturing and predicting its changes an essential topic of study. Therefore, to evaluate our proposed data preprocessing and oversampling methods, we conducted experiments using a time-series deep-learning model, LSTM [49], a type of recurrent neural network suitable for sequential data. In addition, to assess the stability and generalizability of our proposed method, we also conducted experiments on a variant of the LSTM model, the long short-term memory fully convolutional network (LSTM-FCN) [50]. The LSTM-FCN combines the sequential learning capabilities of the LSTM with the pattern recognition capabilities of FCNs, making it highly effective for time-series classification tasks by capturing both temporal and spatial dependencies in the data.

4. Results

The original video data in DAiSEE dataset [10] were divided for training, validation, and testing purposes, with proportions of 60%, 20%, and 20% respectively. We retained all the data with the same proportions. We applied PyTorch to build our LSTM and LSTM-FCN models. PyTorch is an open-source machine learning library based on the Torch library, widely used for applications such as computer-vision and natural language processing. It is recognized for its flexibility and ease of use in building deep-learning models [51].

First, we (1) used StandardScaler to normalize the merged training and validation data, which is designed with 32-dimensional features with 10 timesteps. Next, we (2) divided the training and validation data into an 80:20 ratio, using random state = 10. Then, (3) the processed data was trained in an LSTM model with one LSTM layer, 32 hidden units, and a fully connected layer. At the same time, after steps (1) and (2), we also (4) trained the data using an LSTM-FCN model with a forward pass that involves processing the input through an LSTM layer and three stacked temporal convolutional blocks with filter sizes of 128, 256, and 128, respectively. Each block consists of a temporal convolutional layer, accompanied by batch normalization, followed by a ReLU activation function. The LSTM and convolutional outputs are concatenated and passed through a fully connected layer with softmax activation functions to produce the final output, as shown in Figure 5 [52]. Both LSTM and LSTM-FCN models are trained by 50 epochs. Finally, we tested the models using the original test set. The data processing also employed the StandardScaler normalization and Skipped Moving Average method without oversampling; from the 60 sequences obtained after downsampling, one sequence per second was selected to form a 10-timestep input structure.

Table 5 and Table 6 show the validation and testing results of the proposed method and the SMOTE oversampling technique in our study. The “LSTM (Original), LSTM-FCN (Original)” results refer to the original data without applying the moving average process, after designing the facial and body feature. The input data structure consists of a divided 10-s video, with each part containing 30 frames. Each frame contains 32-dimensional features. “LSTM (SMA), LSTM-FCN (SMA)” shows the results of the input data processed by Skipped Moving Average that set 30 frames as the moving average window with 10 timesteps. “LSTM(SMA+OS), LSTM-FCN (SMA+OS)” presents the outcomes of training data after applying skipped moving average and oversampling processes. The moving average of input data is set with a window of five frames across 10 timesteps. To effectively compare with existing oversampling methods, we also applied the SMOTE technique to oversample the training data. After processing with the Skipped Moving Average, the data samples comprise 60 sequences, with 6 sequences per second. We extract one sequence from each second, forming a structure of 10 timesteps. Subsequently, the data labeled as low engagement was oversampled six times using the SMOTE technique to serve as training data for the LSTM and LSTM-FCN models. This ensures uniformity in the data structure between the two oversampling methods: Skipped Moving Average oversampling and SMOTE oversampling. “LSTM (SMOTE), LSTM-FCN (SMOTE)” presents the outcomes following the application of SMOTE oversampling.

LSTM (Original) and LSTM-FCN (Original) process the time-series data without applying downsampling or Skipped Moving Average techniques. LSTM(SMA) and LSTM-FCN(SMA) involve Skipped Moving Average processing on the original data. In both cases, the input data were not oversampled. The outcomes clearly indicate that a significant disparity emerged between the low-engagement label and other labels, suggesting an impact of data imbalances on classification outcomes. After implementing Skipped Moving Average and oversampling, the results for the low-engagement labels in both LSTM and LSTM-FCN models (LSTM (SMA+OS), LSTM-FCN (SMA+OS)) showed improvement compared to the settings without oversampling. These results indicate that our proposed Skipped Moving Average oversampling method has achieved a certain effectiveness in addressing the class imbalances in time-series data issues.

5. Discussion

The above results demonstrate the effectiveness of our proposed Skipped Moving Average oversampling method on the processing of class-imbalanced video time-series data. SMOTE works by generating new instances from existing minority cases that we supply as input [37]. It is a method used in machine learning to synthesize new samples in the feature space around existing minority class samples. Therefore, the data generated by the SMOTE method are not authentic, and there are also mentioned disadvantages of neighbor selection blindness in related investigations [38]. This may lead to the potential destruction of the characteristic continuity of video time-series data during generation. Our method employs real data for oversampling, ensuring that the synthesized data retain the authenticity and inherent continuity and variability of video time-series data. Consequently, this approach not only enhances accuracy but also demonstrates stability in performance.

Due to the limitations of the existing public dataset, 10-s samples may be too short to accurately reflect the real engagement levels and changes of student engagement in an online educational environment. Typically, changes in student engagement during online learning would not occur within such a brief time frame. This means that our current experiments still have constraints. Additionally, if the duration of the videos were to be altered, the length of the moving average window and the span of time steps would also need corresponding adjustments. Therefore, finding a setting for the Skipped Moving Average window that is more suitable for real-time online learning is also essential. This suggests that there is still significant potential to enhance the proposed Skipped Moving Average method.

However, why does our proposed method show feasibility while its accuracy is not as ideal as expected? Certainly, the lack of low-engagement training and testing samples is an unavoidable factor. Additionally, through our experimental results, we have observed that the majority of misclassifications occurred by assigning students’ engagement levels to high and very high level engagement. Upon manual inspection of the misclassified sample videos, we found that distinguishing between the two labels of engagement is often difficult, even with the human eye. This aligns with findings from related literature reviews [2], where some studies have increased accuracy by removing certain sample videos or frames. However, the outcomes of such methods may be overly optimistic. Therefore, distinguishing some ambiguous high and very high engagement levels in 10-s sample videos accurately is challenging for both humans and deep-learning models.

6. Conclusions

In this study, we addressed the challenge of class imbalanced time-series video data in engagement estimation/detection by proposing a novel approach: Skipped Moving Average oversampling. This method is specifically tailored for processing video time-series data and improves the accuracy of analyzing engagement levels.

RQ1: How do we deal with class imbalanced datasets, DAiSEE?

First, we designed facial and body features for learners in online learning based on existing psychological research on the relationship between internal states and external expressions. We proposed a data processing method, the Skipped Moving Average oversampling method, to address the problem of class imbalance in video time-series data. After processing, the data were fed into LSTM and LSTM-FCN models to investigate the effectiveness of our proposed method. All processing was performed on the DAiSEE dataset. The experimental results indicated that our proposed methods effectively solve the problem of class-imbalanced time-series data.

RQ2: How does the proposed method affect the accuracy of engagement estimation?

To further validate the Skipped Moving Average oversampling method, we compared it with the widely used oversampling method SMOTE to assess their relative effectiveness in handling class imbalanced time-series data. As shown in Table 6, our proposed method outperforms the SMOTE oversampling method in metrics such as Recall, Precision, and F1 score. Thus, there is significant potential for these approaches in related research areas.

The discussed analysis provides new insights for our future research. The training and test data duration is an indispensable aspect that drives this research forward. Making full use of existing public datasets is also an important issue. Therefore, our next step will be to explore transfer learning, using existing datasets as source data to train a model and then validating it on a new dataset closer to the real online educational environment. Using different datasets will also better validate the optimal window frames for Skipped Moving Average oversampling.

Author Contributions

X.Z. and S.H. conceived the research; X.Z. was responsible for data gathering and statistical analysis. All authors contributed to the writing and revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Grants-in-Aid for Scientific Research (KAKENHI), Japan Society for the Promotion of Science under Grant 20H04294 and Grant 23H03505.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in DAiSEE dataset at “https://people.iith.ac.in/vineethnb/resources/daisee/index.html (accessed on 17 May 2024)”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kage, M. Theory of Motivation to Learn: Motivational Educational Psychology; Kaneko Bookstore: Tokyo, Japan, 2013. [Google Scholar]
Dewan, M.A.A.; Murshed, M.; Lin, F. Engagement detection in online learning: A review. Smart Learn. Environ. 2019, 6, 1. [Google Scholar] [CrossRef]
Hollister, B.; Nair, P.; Hill-Lindsay, S.; Chukoskie, L. Engagement in online learning: Student attitudes and behavior during COVID-19. Front. Educ. 2022, 7, 851019. [Google Scholar] [CrossRef]
Martin, F.; Bolliger, D.U. Engagement matters: Student perceptions on the importance of engagement strategies in the online learning environment. Online Learn. 2018, 22, 205–222. [Google Scholar] [CrossRef]
Nouri, J. The flipped classroom: For active, effective and increased learning–especially for low achievers. Int. J. Educ. Technol. High. Educ. 2016, 13, 1–10. [Google Scholar] [CrossRef]
Bolliger, D.U. Key factors for determining student satisfaction in online courses. Int. J. E-Learn. 2004, 3, 61–67. [Google Scholar]
Fredricks, J.A.; Blumenfeld, P.C.; Paris, A.H. School engagement: Potential of the concept, state of the evidence. Rev. Educ. Res. 2004, 74, 59–109. [Google Scholar] [CrossRef]
Karimah, S.N.; Hasegawa, S. Automatic engagement estimation in smart education/learning settings: A systematic review of engagement definitions, datasets, and methods. Smart Learn. Environ. 2022, 9, 1–48. [Google Scholar] [CrossRef]
Kaur, A.; Mustafa, A.; Mehta, L.; Dhall, A. Prediction and localization of student engagement in the wild. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 1–8. [Google Scholar]
Gupta, A.; D’Cunha, A.; Awasthi, K.; Balasubramanian, V. Daisee: Towards user engagement recognition in the wild. arXiv 2016, arXiv:1609.01885. [Google Scholar]
Allen, R.L.; Davis, A.S. Hawthorne Effect. In Encyclopedia of Child Behavior and Development; Goldstein, S., Naglieri, J.A., Eds.; Springer: Boston, MA, USA, 2011. [Google Scholar] [CrossRef]
Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Dresvyanskiy, D.; Minker, W.; Karpov, A. Deep learning based engagement recognition in highly imbalanced data. In Proceedings of the 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, 27–30 September 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 166–178. [Google Scholar]
Anderson, A.R.; Christenson, S.L.; Sinclair, M.F.; Lehr, C.A. Check & Connect: The importance of relationships for promoting engagement with school. J. Sch. Psychol. 2004, 42, 95–113. [Google Scholar]
Reschly, A.L.; Christenson, S.L. Handbook of Research on Student Engagement; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Cocea, M.; Weibelzahl, S. Log file analysis for disengagement detection in e-Learning environments. User Model. User-Adapt. Interact. 2009, 19, 341–385. [Google Scholar] [CrossRef]
Chaouachi, M.; Pierre, C.; Jraidi, I.; Frasson, C. Affect and mental engagement: Towards adaptability for intelligent. In Proceedings of the Twenty-Third International FLAIRS Conference, Daytona Beach, FL, USA, 19–21 May 2010. [Google Scholar]
Fairclough, S.H.; Venables, L. Prediction of subjective states from psychophysiology: A multivariate approach. Biol. Psychol. 2006, 71, 100–110. [Google Scholar] [CrossRef] [PubMed]
Goldberg, B.S.; Sottilare, R.A.; Brawner, K.W.; Holden, H.K. Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. In Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction, ACII 2011, Memphis, TN, USA, 9–12 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 538–547. [Google Scholar]
Zhang, Z.; Li, Z.; Liu, H.; Cao, T.; Liu, S. Data-driven online learning engagement detection via facial expression and mouse behavior recognition technology. J. Educ. Comput. Res. 2020, 58, 63–86. [Google Scholar] [CrossRef]
James, W.T. A study of the expression of bodily posture. J. Gen. Psychol. 1932, 7, 405–437. [Google Scholar] [CrossRef]
Kleinsmith, A.; Bianchi-Berthouze, N. Affective body expression perception and recognition: A survey. IEEE Trans. Affect. Comput. 2012, 4, 15–33. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V. Measuring facial movement. Environ. Psychol. Nonverbal Behav. 1976, 1, 56–75. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Chang, C.; Zhang, C.; Chen, L.; Liu, Y. An ensemble model using face and body tracking for engagement detection. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder CO, USA, 16–20 October 2018; pp. 616–622. [Google Scholar]
Grafsgaard, J.; Wiggins, J.B.; Boyer, K.E.; Wiebe, E.N.; Lester, J. Automatically recognizing facial expression: Predicting engagement and frustration. In Proceedings of the Educational Data Mining, Memphis, TN, USA, 6–8 July 2013. [Google Scholar]
Seventh Emotion Recognition in the Wild Challenge (EmotiW). Available online: https://sites.google.com/view/emotiw2019/home?authuser=0 (accessed on 17 May 2024).
DAiSEE Dataset for Affective States in E-Environments. Available online: https://people.iith.ac.in/vineethnb/resources/daisee/index.html (accessed on 17 May 2024).
Villaroya, S.M.; Gamboa-Montero, J.J.; Bernardino, A.; Maroto-Gómez, M.; Castillo, J.C.; Salichs, M.Á. Real-time Engagement Detection from Facial Features. In Proceedings of the 2022 IEEE International Conference on Development and Learning (ICDL), London, UK, 12–15 September 2022; pp. 231–237. [Google Scholar]
Zheng, X.; Hasegawa, S.; Tran, M.T.; Ota, K.; Unoki, T. Estimation of learners’ engagement using face and body features by transfer learning. In Proceedings of the International Conference on Human–Computer Interaction, Virtual, 24–29 July 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 541–552. [Google Scholar]
Zheng, X.; Tran, M.T.; Ota, K.; Unoki, T.; Hasegawa, S. Engagement Estimation using Time-series Facial and Body Features in an Unstable Dataset. In Proceedings of the 30th International Conference on Computers in Education (ICCE 2022), Kuala Lumpur, Malaysia, 28 November–2 December 2022; pp. 89–94. [Google Scholar]
Ai, X.; Sheng, V.S.; Li, C.; Cui, Z. Class-attention video transformer for engagement intensity prediction. arXiv 2022, arXiv:2208.07216. [Google Scholar]
Jeni, L.A.; Cohn, J.F.; De La Torre, F. Facing imbalanced data–recommendations for the use of performance metrics. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 245–251. [Google Scholar]
Hasegawa, S.; Hirako, A.; Zheng, X.; Karimah, S.N.; Ota, K.; Unoki, T. Learner’s mental state estimation with PC built-in camera. In Learning and Collaboration Technologies. Human and Technology Ecosystems: Proceedings of the 7th International Conference, LCT 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, 19–24 July 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 165–175. [Google Scholar]
Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 243–248. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Jiang, Z.; Pan, T.; Zhang, C.; Yang, J. A new oversampling method based on the classification contribution degree. Symmetry 2021, 13, 194. [Google Scholar] [CrossRef]
Yao, B.; Ota, K.; Kashihara, A.; Unoki, T.; Hasegawa, S. Development of a Learning Companion Robot with Adaptive Engagement Enhancement. In Proceedings of the 30th International Conference on Computers in Education (ICCE 2022), Asia-Pacific Society for Computers in Education, Kuala Lumpur, Malaysia, 28 November–2 December 2022; pp. 111–117. [Google Scholar]
Dewan, M.A.A.; Lin, F.; Wen, D.; Murshed, M.; Uddin, Z. A deep learning approach to detecting engagement of online learners. In Proceedings of the 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Guangzhou, China, 8–12 October 2018; pp. 1895–1902. [Google Scholar]
Murshed, M.; Dewan, M.A.A.; Lin, F.; Wen, D. Engagement detection in e-learning environments using convolutional neural networks. In Proceedings of the 2019 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Fukuoka, Japan, 5–8 August 2019; pp. 80–86. [Google Scholar]
Bosch, N. Detecting student engagement: Human versus machine. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, Halifax, NS, Canada, 13–17 July 2016; pp. 317–320. [Google Scholar]
Bond, M.; Buntins, K.; Bedenlier, S.; Zawacki-Richter, O.; Kerres, M. Mapping research in student engagement and educational technology in higher education: A systematic evidence map. Int. J. Educ. Technol. High. Educ. 2020, 17, 1–30. [Google Scholar] [CrossRef]
Mehrabian, A.; Friar, J.T. Encoding of attitude by a seated communicator via posture and position cues. J. Consult. Clin. Psychol. 1969, 33, 330. [Google Scholar] [CrossRef]
Dael, N.; Mortillaro, M.; Scherer, K.R. Emotion expression in body action and posture. Emotion 2012, 12, 1085. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1145–1153. [Google Scholar]
Ekman, P. Facial expressions. In Handbook of Cognition and Emotion; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 1999; Volume 16, p. e320. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Karim, F.; Majumdar, S.; Darabi, H.; Chen, S. LSTM fully convolutional networks for time series classification. IEEE Access 2017, 6, 1662–1669. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
LSTM-FCN-Pytorch. Available online: https://github.com/roytalman/LSTM-FCN-Pytorch (accessed on 28 January 2024).

Figure 1. Distribution of engagement levels in the “in the wild” [28] and DAiSEE [29] datasets. (We have used the number of downloaded data as the basis for our study).

Figure 2. Application of the Skipped Moving Average method on a 300-frame video.

Figure 3. Oversampling the DAiSEE dataset into 6 segments.

Figure 4. Images of key points extracted using OpenPose [46].

Figure 5. The LSTM and LSTM-FCN architecture.

Table 1. Overview of “in the wild” [28] and DAiSEE [29] datasets.

Dataset	Subjects	Video Snippets	Snippets Time	Total Timeh
“in the wild”	78 (male/female 53/25)	197	5 min	59,100 s
DAiSEE	112 (male/female 80/32)	9068	10 s	90,680 s

Table 2. Preliminary experiments and reproduced results from [33].

Engagement Label	Dataset	Low (Recall/F1)	High (Recall/F1)	Very High (Recall/F1)
LSTM [31,32]	DAiSEE	0.050/0.090	0.740/0.640	0.410/0.470
QRNN [31,32]	DAiSEE	0.000/0.000	0.930/0.660	0.070/0.120
LSTM [33]	DAiSEE	0.068/0.122	0.732/0.625	0.421/0.489
LSTM [33]	“in the wild”	0.571/0.696	0.789/0.682	0.667/0.690

Table 3. Testing results for different Skipped Moving Average windows with 32-D features.

Engagement Label	Low (Recall/Precision/F1)	High (Recall/Precision/F1)	Very High (Recall/Precision/F1)
LSTM (3 frames average)	0.295/0.079/0.125	0.490/0.508/0.499	0.381/0.507/0.435
LSTM (5 frames average)	0.346/0.090/0.142	0.523/0.521/0.522	0.373/0.537/0.440
LSTM (6 frames average)	0.192/0.066/0.098	0.544/0.501/0.521	0.354/0.500/0.414

Table 4. Engagement labels overview: original labels, data consolidation and oversampling results.

Affective State	Very Low/Low	High	Very High
Original Labels	61/459	4477	4071
Relabel	520	4477	4071
Oversample	2764	4009	3286

Table 5. Validation results for the original data, SMOTE, and Skipped Moving Average with 32-D features.

Engagement Label	Low (Recall/Precision/F1)	High (Recall/Precision/F1)	Very High (Recall/Precision/F1)
LSTM (Original)	0.069/0.114/0.086	0.587/0.558/0.572	0.482/0.491/0.486
LSTM-FCN (Original)	0.049/0.211/0.079	0.654/0.562/0.605	0.500/0.553/0.525
LSTM (SMOTE)	0.821/0.751/0.784	0.539/0.557/0.548	0.556/0.579/0.567
LSTM-FCN (SMOTE)	0.792/0.690/0.738	0.538/0.530/0.534	0.515/0.598/0.554
LSTM (SMA)	0.096/0.235/0.137	0.634/0.561/0.595	0.521/0.558/0.539
LSTM-FCN (SMA)	0.036/0.348/0.065	0.694/0.547/0.612	0.502/0.590/0.543
LSTM (SMA+OS)	0.806/0.702/0.751	0.474/0.525/0.498	0.539/0.544/0.541
LSTM-FCN (SMA+OS)	0.637/0.623/0.630	0.527/0.498/0.512	0.510/0.557/0.533

Table 6. Testing results for the original data, SMOTE, and Skipped Moving Average with 32-D features.

Engagement Label	Low (Recall/Precision/F1)	High (Recall/Precision/F1)	Very High (Recall/Precision/F1)
LSTM (Original)	0.069/0.114/0.086	0.587/0.558/0.572	0.482/0.491/0.486
LSTM-FCN (Original)	0.014/0.111/0.025	0.694/0.518/0.594	0.300/0.434/0.355
LSTM (SMOTE)	0.295/0.053/0.089	0.385/0.475/0.425	0.350/0.487/0.407
LSTM-FCN (SMOTE)	0.179/0.037/0.061	0.579/0.503/0.539	0.241/0.558/0.336
LSTM (SMA)	0.192/0.109/0.140	0.665/0.510/0.577	0.314/0.526/0.393
LSTM-FCN (SMA)	0.038/0.071/0.050	0.728/0.526/0.611	0.355/0.553/0.433
LSTM (SMA+OS)	0.346/0.090/0.142	0.523/0.521/0.522	0.373/0.537/0.440
LSTM-FCN (SMA+OS)	0.269/0.063/0.103	0.561/0.512/0.535	0.312/0.562/0.401

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, X.; Hasegawa, S.; Gu, W.; Ota, K. Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”. Educ. Sci. 2024, 14, 556. https://doi.org/10.3390/educsci14060556

AMA Style

Zheng X, Hasegawa S, Gu W, Ota K. Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”. Education Sciences. 2024; 14(6):556. https://doi.org/10.3390/educsci14060556

Chicago/Turabian Style

Zheng, Xianwen, Shinobu Hasegawa, Wen Gu, and Koichi Ota. 2024. "Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”" Education Sciences 14, no. 6: 556. https://doi.org/10.3390/educsci14060556

APA Style

Zheng, X., Hasegawa, S., Gu, W., & Ota, K. (2024). Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”. Education Sciences, 14(6), 556. https://doi.org/10.3390/educsci14060556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Addressing Class Imbalances in Video Time-Series Data for Estimation of Learner Engagement: “Over Sampling with Skipped Moving Average”

Abstract

1. Introduction

2. Related Work

2.1. Definition of Engagement

2.2. Approaches in Emotional Engagement Estimation

2.3. Computer-Vision-Based Features

2.4. Dataset

2.5. Architectures in Emotional Engagement Estimation

2.6. Issues Addressed

3. Proposed Methods

3.1. Sampling Method

3.1.1. Skipped Moving Average and Video Frame Downsampling

3.1.2. Average Oversampling Input Videos

3.2. Feature Extraction Method

3.3. Training Method

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI