Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks

Ma, Yujian; Wei, Yantao; Shi, Yafei; Li, Xiuhan; Tian, Yi; Zhao, Zhongjin

doi:10.3390/su15010198

Open AccessArticle

Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks

by

Yujian Ma

^1,2,

Yantao Wei

^1,2,*

,

Yafei Shi

³,

Xiuhan Li

^1,2,

Yi Tian

^1,2 and

Zhongjin Zhao

^1,2

¹

Hubei Research Center for Educational Informationization, Central China Normal University, Wuhan 430079, China

²

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

³

School of Educational Technology, Northwest Normal University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(1), 198; https://doi.org/10.3390/su15010198

Submission received: 15 November 2022 / Revised: 15 December 2022 / Accepted: 16 December 2022 / Published: 22 December 2022

(This article belongs to the Special Issue Sustainable E-learning and Education with Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Background: Online learning is currently adopted by educational institutions worldwide to provide students with ongoing education during the COVID-19 pandemic. However, online learning has seen students lose interest and become anxious, which affects learning performance and leads to dropout. Thus, measuring students’ engagement in online learning has become imperative. It is challenging to recognize online learning engagement due to the lack of effective recognition methods and publicly accessible datasets. Methods: This study gathered a large number of online learning videos of students at a normal university. Engagement cues were used to annotate the dataset, which was constructed with three levels of engagement: low engagement, engagement, and high engagement. Then, we introduced a bi-directional long-term recurrent convolutional network (BiLRCN) for online learning engagement recognition in video. Result: An online learning engagement dataset has been constructed. We evaluated six methods using precision and recall, where BiLRCN obtained the best performance. Conclusions: Both category balance and category similarity of the data affect the performance of the results; it is more appropriate to consider learning engagement as a process-based evaluation; learning engagement can provide intervention strategies for teachers from a variety of perspectives and is associated with learning performance. Dataset construction and deep learning methods need to be improved, and learning data management also deserves attention.

Keywords:

online learning; learning engagement; deep learning; learning evaluation

1. Introduction

1.1. Research Background

Since the breakout of COVID-19, online learning has garnered considerable attention from schools [1]. Online learning moves face-to-face classes online, which allows real-time interaction between instructors and students, even if they are not in the same classroom. In addition to helping students learn from home, online technology also increases the flexibility of learning, and the use of subsidies [2]. Yet, its widespread utilization has been accompanied by several problems. Online learning does not encourage meaningful relationships between teachers and students or between students themselves [2,3,4], which could lead to online education has a higher dropout rate than offline education [5]. Online learning in certain subjects can increase the anxiety of some students with a negative view of their abilities, which does not help them achieve better academic results [6,7]. These problems are very detrimental to the education and growth of students.

Teaching strategies can benefit students in online learning, and teachers can help students regain interest in studying in various ways, such as by offering instructional materials [8]. Due to the limitations of devices and networks, it is difficult for teachers to accurately assess each student’s performance in the online learning environment, thus making it hard for them to effectively intervene in the classroom to ensure the quality of student learning [9,10]. A large number of students in Chinese classrooms also makes it difficult for teachers to pay attention to each student. Therefore, it is important to be able to help teachers obtain the status of their students’ online learning so that they can target their teaching strategies.

Monitoring the quality of students’ online learning to save those who are about to dropout will become the future entry point of online education. Students learning can be evaluated by learning engagement, which is directly related to learning performance [11]. The effective recognition of students’ online learning engagement has become an essential consideration of teachers’ intervention in student learning and improving teaching quality. Initially, manual methods were used to assess student engagement, but this method is time-consuming and labor-intensive, and the results can be significantly subjectively influenced. There are also methods for assessing student engagement through external observations that have high demands for the observer. Since the development of information technology, automatic recognition methods based on learning data have received much attention from researchers. Automatic recognition methods are non-intrusive and do not interrupt the student learning process compared to other methods.

Currently, most of the automated methods used to learn engagement recognition are based on deep learning models [7,12]. Data drive deep learning, but data on learning engagement now face problems, including complicated data modalities, a shortage of open-access datasets, and uneven data annotation standards, which directly limit the results of learning engagement for automatic recognition. In addition, differences in learning performance across ethnic groups make it more difficult to systematically advance the automatic recognition of learning engagement appropriate for China.

There are two main methods of learning engagement in recognition: using physiological signals (e.g., heart rate, brainwave, skin electricity, etc.) and using behaviors (e.g., posture, gestures, facial expressions, etc.) [12]. However, collecting physiological signals in an online learning context requires wearable equipment, which is more difficult to achieve. Instead, it is more feasible to use student behavior in learning videos recorded via webcam because this method allows data to be collected without invading the student learning process.

1.2. Learning Engagement and Its Measurement Methods

Learning engagement often appears as an antithesis to learning burnout, which was introduced in 1985 by Meier et al. [13]. They believe that learning burnout is a state of physical and mental exhaustion that originates from a vicious cycle between the learning environment and the learner, including three aspects of emotional exhaustion, behavioral misconduct, and low personal achievement. In 2004, Fredricks et al. [14] provided a widely accepted definition of learning engagement, asserting that learning engagement consists of a multidimensional structure of emotional, behavioral, and cognitive engagement.

The level of student engagement in learning is directly correlated with the learning’s quality [15]. Learning engagement refers to the learner’s positive and engaged mind in the learning situation and activity. Learning engagement measurement dates back as far as 1980. The main methods of learning engagement measurement include self-feedback reports, external observations, and automated recognition. Self-feedback reporting methods mark student engagement through student self-report or questionnaires, such as Greene et al. [16], who used the Likert scale to investigate student engagement. Self-feedback reports often depend on the learner’s apparent understanding of the learning engagement, their level of compliance, and their memory of the learning process, even if they are frequently convenient and useful. External observation is another important method for assessing learning engagement [17]. Still, it requires a certain level of expertise from the observer, which makes it difficult to deal with large amounts of data. The automatic recognition method aims to evaluate student engagement by utilizing trimming technologies such as machine learning and computer vision, which can successfully address the aforementioned shortcomings [18]. Although the automatic recognition method of learning engagement currently has problems such as difficult data collection and annotation, low recognition performance, and low interpretability, we still believe that automatic recognition is promising for the future.

1.3. Video-Based Recognition of Learning Engagement

Compared to the constraints of self-feedback reporting methods and external observation methods (e.g., time-consuming and labor-intensive, unable to handle huge amounts of data, etc.), automated recognition systems perform better. The automated recognition method collects many performance indicators from the student’s learning process and evaluates the learning engagement based on the gathered data without interfering with the student’s learning process. Video data have evolved into the primary modality used in learning engagement recognition studies due to their convenience of collection and heavy information content [19,20]. There are additional examples of engagement recognition studies that utilize other modal data types such as images [21], audio [22], and physiological data [23].

In this study, we concentrated on learning engagement recognition work based on video data (see Table 1). Gupta et al. [24] proposed the DaiSEE dataset and used traditional Long-term Recurrent Convolutional Networks, C3D, and other networks for four classification learning engagement predictions. Zaletelj et al. [19] proposed a large-scale analysis mechanism of student classroom behavior data obtained by the Kinect One sensor, which can estimate the level of student attention and engagement in the classroom and give teachers feedback on instructional evaluation based upon which teachers can adjust instruction in a way that is tailored to students to support learning performance. Huang et al. [25] proposed an engagement recognition network (DERN) based on temporal convolution, Bi-directional Long Short Term Memory, and an attention mechanism for the DAiSEE dataset. Abedi et al. [26] used the hybrid end-to-end network of ResNet (Residual Network) and TCN (Temporal Convolution Network) to analyze the original video sequences, and the results outperformed other approaches for the same dataset. Sümer et al. [20] collected facial video data from 128 students in grades 5–12 in a classroom and utilized three methods—SVM, MLP, and LSTM—to predict student learning engagement using a scale of −2 to 2 to represent off-task to on-task and to compare the engagement levels of students in different grades. Liao et al. [27] extracted facial features from the DAiSEE dataset using a pre-trained SENet and then utilized an LSTM network with a global attention mechanism to predict learning engagement. Mehta et al. [28] proposed a three-dimensional DenseNet self-attentive network, compared the results to current methods for two- and four-classification metrics, and verified the network’s robustness using the EmotiW dataset.

Although learning engagement measurement had garnered attention before the pandemic, its widespread growth was nonetheless a result of the epidemic. Deep learning, a branch of machine learning that uses data for feature learning with artificial neural networks, has emerged as a major feasible approach to learning engagement automated recognition. The models chosen by current deep learning methods for learning engagement mostly focus on the temporal features of students’ behavioral performance, but it is also a simple use. It has become a consensus among researchers that videos of students’ facial expressions are the most representative data of student learning engagement. Still, there is no standard paradigm for handling the data. In addition, the evaluation of existing methods is mostly based on the accuracy and mean square error, which is intuitive but lacks a certain degree of comprehensiveness.

1.4. Dataset for Engagement Recognition

Most studies based on open-access database learning engagement measures are based on HBUC [30], DAiSEE (Dataset for Affective States in E-Environments) [24] and in-the-wild datasets [29]. The HBUC data were collected from thirty-four people from two distinct pools; nine men and thirty-five women. Individuals in both pools participated in Cognitive Skills Training research coordinated by a Historically Black College/University (HBCU) and the University of California (UC). The DAiSEE dataset is a multi-label video classification dataset made up of 9068 video clips from 112 subjects with labels for boredom, confusion, engagement, and frustration. Each label is represented by level 0 (very low), level 1 (low), level 2 (high), and level 3 (very high). Of these, the number of engagements is 61, 459, 4477, and 4071, respectively. The in-the-wild dataset included 78 people and 195 movies (each lasting around 5 min), collected in unrestricted settings such as computer laboratories, dorm rooms, open spaces, and so on. Labels of in-the-wild are disengaged, barely engaged, normally engaged, and highly engaged. The labels in the DAiSEE and “in-the-wild” were determined by crowdsourcing, while human experts were used for labeling in the HBCU. Considering differing labeling standards might result in unclear engagement labels; some research excludes data with ambiguous labels, which improves the results’ accuracy but reduces the data’s amount and variety. In addition, there is no dataset of learning engagement for Chinese students.

Additionally, it is more difficult to collect data because various learning environments, learning tasks, and student objects have different data-gathering and processing methods. So, the learning engagement data gathered via the collection are typically small samples. In conclusion, the data utilized in current learning engagement research range in terms of data unit duration, annotators differences, labeling criteria, and data collection processes, making it difficult to develop learning engagement recognition systematically. Data may be aligned by researchers using a variety of data processing techniques, but before undertaking a study, the researcher must discuss and set up the data annotation process.

1.5. Problem Statement

After the above description, the main problems of learning engagement recognition are currently as follows:

I: How to construct a more realistic dataset of online learning engagement due to the lack of publicly available datasets?
II: How to improve the automatic recognition results of deep learning-based learning participation for practical applications?
III: How should learning engagement results help teachers develop teaching intervention strategies?

1.6. Contributions

This study built an online learning engagement dataset of videos of students recruited from a university in Wuhan, Hubei Province. Learning engagement cues were used to establish the tri-categorized label for this dataset. Based on this dataset, we investigated the automatic recognition method of students’ learning engagement in online learning scenarios through the BiLRCN network. Finally, we analyze and discuss the results, explore the feasible methods for the automatic recognition of learning engagement, and propose future research directions. The contributions of our work can be summarized as follows:

We created a dataset for the learning engagement of Chinese students that is more quantifiable, interpretable, and annotated by multiple engagement cues. The dataset consists of online learning videos of Chinese students, with a video duration of 10 s.
We introduced the Bi-directional Long-Term Recurrent Convolutional Neural Networks (BiLRCN) framework for recognizing engagement from videos. This method focuses on the sequential features of learning engagement using the TimeDistributed layer, and its effectiveness has been verified on the self-build dataset.

The rest of this article is structured as follows: Section 2 describes the processing of the collected dataset, which includes the collection process, annotation criteria, etc. Section 3 describes the bidirectional long-time convolutional network introduced in this study. Section 4 shows this experiment’s experimental metrics and results and the comparison with the results of the other five state-of-the-art methods. Section 5 is a discussion of these results. Finally, conclusions and possible future research directions are given in Section 6.

2. Dataset Construction

2.1. Data Collection

An HD webcam (Logitech C930c, 1920*1080, 30 Fps) was mounted on a laptop computer and utilized to collect video data from students engaging in online learning. For this study, 58 undergraduate or graduate students between the ages of 21 and 25 were recruited; 42 females and 16 males. They spanned six majors, including educational technology, computer science, psychology, and more. We used OBS software (Open Broadcaster Software) to perform screen recordings of students’ computers during the experiment to guarantee that they executed the learning tasks assigned. Additionally, we used a unique custom software program to record videos of the students’ faces and their body parts, and the recorded films served as the raw data. Apart from being required to be in front of the computer, students were not constrained in any other manner.

Three online learning tasks were given to the participants in this experiment:

Watch a one-minute, thirty-second medical video, then answer one easy multiple-choice question about the content within the allotted four minutes;
Read documents related to machine learning material (accuracy and recall) and finish 9 challenging calculation questions with an 8-min time constraint based on the material;
Answer two multiple-choice questions based on the content after watching a 6-min English video on the development of facial recognition. The participants in this exercise have a 7-min time constraint.

Besides completing the learning tasks, participants were also required to rate the difficulty of the learning tasks and indicate whether they had been exposed to the tasks’ material. Before the experiment, student subjects were informed of the experimental procedure and the requirements of the experiment. It can reduce the Hawthorne effect by allowing students breaks before the experiment begins and between each task. Five staff members directed the experiment, but they did not interfere with the participants’ experiment operation. After the experiment, students were asked to use the recorded video to recall each full minute of engagement. All student subjects featured in the video signed an informed consent form before the experiment, and each student subject provided only age, gender, and major as their identifying information. The experiment is shown in Figure 1.

Excluding the lost or missed videos during the experiment, we gathered 1073 min of raw video data (saved in avi format and encoded in H264 format). To align the data for subsequent data annotation, we used the FFmpeg tool to crop the videos after the initial data screening to obtain 6308 raw 10-s videos.

2.2. Data Annotation

Three annotators with experience in learning engagement annotation performed the annotation work in this study. Before starting the data annotation work, the annotators systematically learned the annotation classification and criteria and completed the reliability assessment before the annotation. Based on the common trichotomous classification used in contemporary learning engagement recognition research, the data labels in this study were separated into three groups, namely low engagement, engagement, and high engagement (marked by 0, 1, and 2 from low to high, respectively). Low engagement indicates that the student is not engaged in the learning content or there is a clear indication of engagement in other non-learning content; engagement shows that learners are involved in the learning process, but this involvement is limited and susceptible to interruption; high engagement entails complete engagement in the learning process and a high level of stability under disturbance. Annotators were asked to use the intensity and proportion of brief learning engagement cues (see Table 2) observed in students’ videos to determine the level of learning engagement.

Eye movement-related cues were used as the primary cue in labeling, followed by facial expressions, body movements, and pre-and post-temporal states to determine the degree of engagement. Given the annotation’s subjective nature, the annotators considered the student subjects’ self-feedback results on the annotation. Annotation quality control was performed through an MV (Most Voting) strategy and repeated annotation through collective discussion.

To facilitate the subsequent training, validation, and testing of the model, we disorganized the order of the labeled video data and divided the training, validation, and testing sets in the ratio of 6:2:2. The final distribution of the data is shown in Table 3. Unique subject numbers index the data, and subject 042 is used as an example; Figure 2 shows the partial performance of different engagements of this subject (chosen 10 frames).

3. Online Learning Engagement Recognition Method

The Long-term Recurrent Convolutional Network (LRCN) [31] is a deep learning network that combines Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) networks. It can process temporal video or single-frame image inputs as well as single-value prediction and temporal prediction, making it an agglomeration network for processing sequential inputs or outputs. LRCN has been widely used in activity recognition, image description, video description, etc. Due to the excellent performance of LRCN, some improvements have been made. For example, Yan et al. [32] proposed bidirectional LRCN for stress recognition, and their results also show that bidirectional LSTM is helpful for video classification.

Given that learning engagement is a process performance, this paper utilized a Bidirectional Long-Term Recurrent Convolutional Network (BiLRCN) that combines a two-dimensional convolutional neural network (2DCNN) packed by a TimeDistributed layer with BiLSTM for learning engagement recognition. Taking video 0191012 in the dataset as an example, the BiLRCN network structure used in this study is shown in Figure 3. The network consists of four main parts, from left to right, video frame input layer, spatio-temporal feature extraction layer, time series learning layer, and determination layer. The model takes the video frame sequence as input and uses a 2D convolutional neural network (2DCNN) wrapped by a TimeDistributed layer to extract the spatio-temporal features. The extracted features are passed through a BiLSTM network for temporal feature learning to obtain the temporal output of the network. Then the final output of the network is obtained through a fully connected layer with softmax as the activation function.

3.1. Features Extraction

2DCNN means that the convolution kernel performs a sliding window operation in the two-dimensional space of the input image, which preserves the spatial features of individual video frames. The fact is that a 2DCNN can only receive one frame for convolution. While this can help us identify students in an image, we are now seeking to identify students at varying degrees of engagement, which requires numerous frames in a sequence to decide. If we train a convolutional network stream for each image, this requires a lot of computation time, so for sequential video frame sequences, TimeDistributed wrappers can focus on temporal features. The TimeDistributed wrapper enables the wrapped CNN layers to be applied to each time slice of the input, which allows the spatial features extracted by the convolutional network to preserve the temporal feature well. VGG16 inspired us to extract features with convolution-pooling-convolution-pooling, which allows for more nonlinear variations in the data. We employed tiny convolutional kernels (3 × 3) for feature extraction in the CNN section of the model, which inevitably deepens the network depth but also reduces parameters and improves model generalization ability.

3.2. Sequence Learning

Although BiLSTM is mostly used in NLP domains, such as sentence classification, the learning engagement is considered continuous, based upon which we also used BiLSTM for engagement classification. The BiLSTM network may consider the contextual information of learning engagement because it adds the inverse operation to the traditional LSTM, which enables the network to assess depending on students’ pre- and post-learning states. BiLSTM will prevent us from classifying the video in real time. Still, it more follows our annotation work on the data than LSTM, i.e., utilizing one label to represent a whole 10 s video.

4. Experiment

4.1. Experimental Setting

The experimental environment for this study was configured with NVIDIA GeForce RTX 3070 8 G (GPU), intel i7-11700 (CPU), and Windows 10 (OS), with Keras (deep learning framework). The input dimension was 40 × 80 × 80 × 3 (NHWC), where 40 represents the length of the input video frame sequence, 80 × 80 represents the image resolution, and 3 represents the three channels of RGB color image; the output dimension was three-dimensional, representing its possibility for three engagement levels, respectively, and the dimension in which the maximum value was taken as the final prediction result.

4.2. Evaluation Metrics

The precision (P) and recall (R) metrics were used to measure model performance in this experiment, and they were computed as given in Equations (1) and (2), respectively.

P = \frac{T P}{T P + {F P}^{'}}

(1)

R = \frac{T P}{T P + {F N}^{'}}

(2)

where TP (True Positives) indicates the number of properly predicted target engagements, FP (False Positives) represents the number of mistakenly predicted target engagements, and FN (False Negatives) represents the number of target engagements that were not successfully identified.

4.3. Experimental Results

In the same experimental setting, this study compared the performance of different methods on our dataset, and the results are shown in Table 4 and Figure 4. In Figure 4, the horizontal rows show the real category of the video, the vertical columns show the model’s predicted category, and the brackets represent the recall R of the current category.

Compared to other methods, the accuracy of BiLRCN and LRCN is higher, and the accuracy and recall of BiLRCN in different categories are higher than those of LRCN, demonstrating that learning engagement as a process performance and considering its temporal features for assessment can effectively improve the accuracy. However, there is still space for improvement in deep learning, which could be due to the following factors: the first is that manual features were not extracted before the experiment, which will increase the noise in the training data; and the second is that the videos in the dataset are all adult learning videos, and the performance of adults when they learn may be more implicit, which will also raise the difficulty of judgment.

When comparing the results of different engagements, the results are more likely to be high engagement because the model’s learning effects for different learning engagements are also imbalanced due to the uneven distribution of the dataset. The precision and recall of high engagement are generally and significantly higher than those of low engagement and engagement. The high precision means that the model is more accurately judged for high engagement because the continuous performance of students in the high engagement state is less variable than that of low engagement and engagement. Although high recall might lead to false detection, it also implies that the model will detect every conceivable high engagement, making the distinction between high learning engagement (high engagement) and low learning engagement (low engagement, engagement) more obvious. Teachers’ instructional interventions are primarily addressed to students with low learning engagement [36] in practice, making the use of recall to evaluate the results of learning engagement measures more appropriate. Furthermore, most engagement false detection is high rather than low because of judging the students’ internal cognitive processes solely through external observation of video data.

5. Discussion

Some of the arguments for the experimental results will be explained and discussed in this section. We found that the accuracy results of the automatic recognition of video-based learning engagement do not perform as well as the classification in other domains. Although the experimental results have been analyzed, a few points still need to be discussed after comparison with other studies.

5.1. Discussion of Experimental Results

Sample imbalance promotes unbalanced results. Regardless of the method used, the results show that the category with more training samples performs better in the testing stage. While we believe that the sample imbalance in the self-built dataset is realistic, the phenomenon would also show that misclassifying uncommon samples does not significantly impact the overall precision. So a general phenomenon in the experimental results is that the categories with a higher number in the training set perform better in the recognition results. Therefore, improving the precision performance from the algorithmic perspective requires focusing on categories with large data. It is not always proper to enhance numerical performance from an educational standpoint. For example, in [26], Abedi et al. conducted a study on learning engagement recognition methods based on DAiSEE. Although their method showed a good improvement in accuracy, it did not work well for the recognition of the few labels. Recall metrics indicate that high engagement is more efficiently and accurately detected in this experiment and that the meaning of recall is more consistent with actual instructional needs, which illustrates the validity of using recall to evaluate our results.

Samples of similar categories are prone to be misclassified. Our dataset identifies different learning engagements based on the proportion and timing of learning engagement cues, which annotators understand, but machines do not. Therefore, the data classification in the dataset is not a complete discrete value, especially between engagement and high engagement, making it difficult for deep learning models to distinguish between them. Therefore, misdetections of 0 (low engagement) and 2 (high engagement) are more likely than 1 (engagement). Although the study by Bergdahl et al. [37] concluded that engagement lacks intrinsic boundaries, it is uncertain for us to judge whether learning engagement is classified as a discrete or continuous value. The advantage of labeling learning videos with engagement cues is that learning engagements are treated as discrete values while maintaining continuity, which facilitates the development of subsequent extension studies. There is no convincing study of learning engagement as continuous values with labeled annotations and open-access datasets.

Learning engagement is appropriate as a process-based, comprehensive assessment. The method considering temporal features outperforms other approaches. Additionally, our method focuses more on temporal contextual features and produces the best results. In our annotation work, annotators make judgments based on continuous behavior. In this study, learning engagement was reported in ten-second intervals, and there is no accepted standard for exactly how long is most accurate. Video-based learning engagement recognition models are mostly end-to-end, reducing labor costs but exposing the model to more noise while learning. Although the difference in the three learning engagements’ sequential performance has already been mentioned in earlier sections, they are still negligible compared to other study domains. Since learning engagement can be represented in three dimensions: behavioral, affective, and cognitive, it is more acceptable to consider more fine-grained features for engagement recognition in cases where observable cues are not evident from students during online learning. For example, we can achieve this by extracting more manual features from the video or integrating other modal data.

5.2. Discussion of Learning Engagement Application

Effective and diversified application of learning engagement is required. Teachers’ attention cannot match the recognition of learning engagement when teaching online, which requires the integration or adjustment of individual students’ online learning engagement. Applying learning engagement efficiently in diverse ways is also a future research direction for us. At the individual level, the evolution of student engagement can be a good indicator of how engaged students are throughout the classroom. We can alert teachers to chronically low-engagement students. At the class level, we can aggregate the overall engagement of all students in the class and promptly notify the teacher when class engagement is generally low. The teacher can increase class interest by selecting simple, attractive learning media or preparing short, clear, and easy-to-understand learning materials. In addition, teachers should regularly assess students’ status and intervene with students who have lost interest in learning for a long time.

The relationship between learning engagement and learning performance is worth exploring. Before the study, we thought his performance would probably be higher if a student’s engagement status were consistently high and stable. Taking 0082 (student 008 for the second task) and 0032 as examples, 0082’s learning engagement was more erratic than 0032’s, with 0082 mostly low engagement (0) and engagement (1). At the same time, 0032 was mostly high engagement (2), and, as a result, 0082 did not complete the questions, while 0032 answered eight out of nine questions correctly. In addition, among low-performing students, most of them have low or fluctuating engagement statuses. Some data show the opposite result, and we believe there is an incompleteness in judging students’ internal mental activity solely through external performance due to the cognitive dimension of learning engagement.

5.3. Discussion of Future Development

We envision a way to recognize learning engagement without recording raw data, and its application, in reality, requires more in-depth research on the automatic recognition of learning engagement.

More comprehensive datasets rely on more accurate automatic identification methods. The comprehensiveness here refers not only to the comprehensiveness of data types but also to the comprehensiveness of research subjects. There was no significant overall change in adult student performance during this study, making accurate annotation of labels more difficult. The research topic of learning engagement recognition should be expanded to explore the learning engagement characteristics of students at different levels and to create a comprehensive dataset with more explanatory labeling criteria. We believe that improving the performance of the results can be optimized in two parts: data feature extraction and deep learning network construction. In this study, we have shown that considering the temporal features of the learning engagement can improve recognition accuracy, but again we found that some data were wasted. The granularity of feature extraction and the complexity of deep learning networks are the future directions of automatic learning engagement recognition methods based on the video.

We must acknowledge that there are some risks in recognizing learning engagement. The security and ethical issues of educational data are of great importance, which not only requires researchers to maintain the confidentiality of data throughout the process but also requires stronger legislative efforts at the policy level. The data security risk mainly occurs during the data upload and storage process. Therefore, we envisage that in the data upload phase, students download the program with recognition on their computers, perform the recognition locally and upload only the recognition results; in the storage phase, the data should be encrypted and then stored in a private cloud which reduces the data risk since it can be built inside a firewall. Ethical issues of video data are mainly related to collection, representation, storage, and analysis [38]. The authenticity and objectivity of the data collection and representation process need to be paid attention to. We constructed a more realistic experimental environment for the characteristics of the Chinese online learning environment. We referred to the students’ self-report for labeling, which, to a certain extent, does not have ethical problems. Storage and analytics are mainly concerned with security-oriented issues, which were also discussed previously.

6. Conclusions

This paper aims to deal with teachers’ difficulty in perceiving students’ online learning engagement in a timely and accurate way. In this paper, many online learning videos have been collected, and an online learning engagement dataset has been constructed. Furthermore, a deep learning-based engagement recognition method was also introduced in this paper, and we compared the performance of different methods on the dataset based on this method. Finally, we discussed experimental results, learning engagement applications, and future developments. This study could provide teachers with reliable assistance with evaluating student engagement and conducting learning interventions.

However, the present study also has some limitations. We will continue to improve the learning engagement automatic recognition research in the future. First, we will gather data from various stages, situations, and engagement categories, implement data annotation work using more interpretable standards and build a comprehensive learning engagement dataset. Second, we will constantly modify the model to enhance precision by dealing with data imbalance, extracting finer-grained features for learning, and so on. Finally, we will use multimodal data (such as physiological signals) for engagement recognition in the future.

Author Contributions

Conceptualization, Y.M., Y.S. and Y.W.; methodology, Y.W. and Y.M.; software, Y.M.; validation, Y.M., X.L. and Y.W.; formal analysis, Y.M., Y.T., Z.Z. and Y.W.; resources, Y.M., Y.S. and Y.W.; data curation, Y.M., Y.T. and Z.Z.; writing—original draft preparation, Y.M.; writing—review and editing, Y.W., Y.S. and X.L.; visualization, Y.M. and Y.S.; supervision, X.L. and Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62277029, the National Collaborative Innovation Experimental Base Construction Project for Teacher Development of Central China Normal University under Grant CCNUTEIII-2021-19, the Humanities and Social Sciences of China MOE under Grants 20YJC880100 and 22YJC880061, the Fundamental Research Funds for the Central Universities under Grant CCNU22JC011, and Knowledge Innovation Project of Wuhan under Grant 2022010801010274.

Institutional Review Board Statement

This research study was conducted in accordance with the ethical standards of the Helsinki Declaration. The Central China Normal University Institutional Review Board (CCNU IRB) usually exempts educational research from the requirement of ethical approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ladino Nocua, A.C.; Cruz Gonzalez, J.P.; Castiblanco Jimenez, I.A.; Gomez Acevedo, J.S.; Marcolin, F.; Vezzetti, E. Assessment of Cognitive Student Engagement Using Heart Rate Data in Distance Learning during COVID-19. Educ. Sci. 2021, 11, 540. [Google Scholar] [CrossRef]
Pirrone, C.; Varrasi, S.; Platania, G.; Castellano, S. Face-to-Face and Online Learning: The Role of Technology in Students’ Metacognition. CEUR Workshop Proc. 2021, 2817, 1–10. [Google Scholar]
Mubarak, A.A.; Cao, H.; Zhang, W. Prediction of students’ early dropout based on their interaction logs in online learning environment. Interact. Learn. Environ. 2020, 30, 1414–1433. [Google Scholar] [CrossRef]
Wang, K.; Zhang, L.; Ye, L. A nationwide survey of online teaching strategies in dental education in China. J. Dent. Educ. 2021, 85, 128–134. [Google Scholar] [CrossRef] [PubMed]
Fei, M.; Yeung, D.Y. Temporal Models for Predicting Student Dropout in Massive Open Online Courses. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; pp. 256–263. [Google Scholar] [CrossRef]
Pirrone, C.; Di Corrado, D.; Privitera, A.; Castellano, S.; Varrasi, S. Students’ Mathematics Anxiety at Distance and In-Person Learning Conditions during COVID-19 Pandemic: Are There Any Differences? An Exploratory Study. Educ. Sci. 2022, 12, 379. [Google Scholar] [CrossRef]
Liu, S.; Liu, S.; Liu, Z.; Peng, X.; Yang, Z. Automated detection of emotional and cognitive engagement in MOOC discussions to predict learning achievement. Comput. Educ. 2022, 181, 104461. [Google Scholar] [CrossRef]
Sutarto, S.; Sari, D.; Fathurrochman, I. Teacher strategies in online learning to increase students’ interest in learning during COVID-19 pandemic. J. Konseling Dan Pendidik. 2020, 8, 129. [Google Scholar] [CrossRef]
Hoofman, J.; Secord, E. The Effect of COVID-19 on Education. Pediatr. Clin. N. Am. 2021, 68, 1071–1079. [Google Scholar] [CrossRef]
El-Sayad, G.; Md Saad, N.H.; Thurasamy, R. How higher education students in Egypt perceived online learning engagement and satisfaction during the COVID-19 pandemic. J. Comput. Educ. 2021, 8, 527–550. [Google Scholar] [CrossRef]
You, W. Research on the Relationship between Learning Engagement and Learning Completion of Online Learning Students. Int. J. Emerg. Technol. Learn. (iJET) 2022, 17, 102–117. [Google Scholar] [CrossRef]
Shen, J.; Yang, H.; Li, J.; Cheng, Z. Assessing learning engagement based on facial expression recognition in MOOC’s scenario. Multimed. Syst. 2022, 28, 469–478. [Google Scholar] [CrossRef] [PubMed]
Meier, S.T.; Schmeck, R.R. The Burned-Out College Student: A Descriptive Profile. J. Coll. Stud. Pers. 1985, 26, 63–69. [Google Scholar]
Fredricks, J.A.; Blumenfeld, P.C.; Paris, A.H. School Engagement: Potential of the Concept, State of the Evidence. Rev. Educ. Res. 2004, 74, 59–109. [Google Scholar] [CrossRef] [Green Version]
Lei, H.; Cui, Y.; Zhou, W. Relationships between student engagement and academic achievement: A meta-analysis. Soc. Behav. Personal. Int. J. 2018, 46, 517–528. [Google Scholar] [CrossRef]
Greene, B.A. Measuring Cognitive Engagement With Self-Report Scales: Reflections From Over 20 Years of Research. Educ. Psychol. 2015, 50, 14–30. [Google Scholar] [CrossRef]
Dewan, M.; Murshed, M.; Lin, F. Engagement detection in online learning: A review. Smart Learn. Environ. 2019, 6, 1. [Google Scholar] [CrossRef]
Hu, M.; Li, H. Student Engagement in Online Learning: A Review. In Proceedings of the 2017 International Symposium on Educational Technology (ISET), Hong Kong, China, 27–29 June 2017; pp. 39–43. [Google Scholar] [CrossRef]
Zaletelj, J.; Košir, A. Predicting students’ attention in the classroom from Kinect facial and body features. EURASIP J. Image Video Process. 2017, 2017, 80. [Google Scholar] [CrossRef] [Green Version]
Sümer, Ö.; Goldberg, P.; D’Mello, S.; Gerjets, P.; Trautwein, U.; Kasneci, E. Multimodal Engagement Analysis from Facial Videos in the Classroom. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
Zhang, Z.; Li, Z.; Liu, H.; Cao, T.; Liu, S. Data-driven Online Learning Engagement Detection via Facial Expression and Mouse Behavior Recognition Technology. J. Educ. Comput. Res. 2020, 58, 63–86. [Google Scholar] [CrossRef]
Standen, P.J.; Brown, D.J.; Taheri, M.; Galvez Trigo, M.J.; Boulton, H.; Burton, A.; Hallewell, M.J.; Lathe, J.G.; Shopland, N.; Blanco Gonzalez, M.A.; et al. An evaluation of an adaptive learning system based on multimodal affect recognition for learners with intellectual disabilities. Br. J. Educ. Technol. 2020, 51, 1748–1765. [Google Scholar] [CrossRef]
Apicella, A.; Arpaia, P.; Frosolone, M.; Improta, G.; Moccaldi, N.; Pollastro, A. EEG-based measurement system for monitoring student engagement in learning 4.0. Sci. Rep. 2022, 12, 5857. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; D’Cunha, A.; Awasthi, K.; Balasubramanian, V. DAiSEE: Towards User Engagement Recognition in the Wild. arXiv 2016, arXiv:1609.01885. [Google Scholar]
Huang, T.; Mei, Y.; Zhang, H.; Liu, S.; Yang, H. Fine-grained Engagement Recognition in Online Learning Environment. In Proceedings of the 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 12–14 July 2019; pp. 338–341. [Google Scholar] [CrossRef]
Abedi, A.; Khan, S.S. Improving state-of-the-art in Detecting Student Engagement with Resnet and TCN Hybrid Network. arXiv 2021, arXiv:2104.10122. [Google Scholar]
Liao, J.; Liang, Y.; Pan, J. Deep facial spatiotemporal network for engagement prediction in online learning. Appl. Intell. 2021, 51, 6609–6621. [Google Scholar] [CrossRef]
Mehta, N.K.; Prasad, S.S.; Saurav, S.; Saini, R.; Singh, S. Three-Dimensional DenseNet Self-Attention Neural Network for Automatic Detection of Student’s Engagement. Appl. Intell. 2022, 52, 13803–13823. [Google Scholar] [CrossRef] [PubMed]
Kaur, A.; Mustafa, A.; Mehta, L.; Dhall, A. Prediction and Localization of Student Engagement in the Wild. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
Whitehill, J.; Serpell, Z.; Lin, Y.C.; Foster, A.; Movellan, J.R. The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions. IEEE Trans. Affect. Comput. 2014, 5, 86–98. [Google Scholar] [CrossRef]
Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. arXiv 2014, arXiv:1411.4389. [Google Scholar]
Yan, S.; Adhikary, A. Stress Recognition in Thermal Videos Using Bi-Directional Long-Term Recurrent Convolutional Neural Networks. In Neural Information Processing: Proceedings of the 28th International Conference ICONIP 2021, Sanur, Bali, Indonesia, 8–12 December 2021; Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 491–501. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv 2014, arXiv:1412.0767. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 1800–1807. [Google Scholar] [CrossRef] [Green Version]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. arXiv 2018, arXiv:1812.03982. [Google Scholar]
Parsons, J.; Taylor, L. Improving Student Engagement. Curr. Issues Educ. 2011, 14, 132. [Google Scholar]
Bergdahl, N. Engagement and disengagement in online learning. Comput. Educ. 2022, 188, 104561. [Google Scholar] [CrossRef]
Peters, M.; White, E.; Besley, T.; Locke, K.; Redder, B.; Novak, R.; Gibbons, A.; O’Neill, J.; Tesar, M.; Sturm, S. Video ethics in educational research involving children: Literature review and critical discussion. Educ. Philos. Theory 2020, 53, 1–9. [Google Scholar] [CrossRef]

Figure 1. Data Collection. (a) Experimental procedure; (b) Experimental example.

Figure 2. Example of online learning engagement dataset. (a) disengagement; (b) engagement; (c) high engagement.

Figure 3. Architecture for BiLRCN.

Figure 4. Confusion matrix for experimental results, (a) BiLRCN, (b) LRCN, (c) ResTCN, (d) C3D, (e) Xception, (f) SlowFast.

Table 1. Automatic recognition of learning engagement.

Research	Year	Data	Method	Setting	Accuracy
Gupta et al. [24]	2016	DAiSEE	LRCN	Online learning	57.9%
Zaletelj et al. [19]	2017	posture, expression	DT, KNN	watch Lectures	-
Kaur et al. [29]	2018	in-the-wild	LSTM	watch videos	-
Huang et al. [25]	2019	DAiSEE	DERN	Online learning	60.0%
Abedi et al. [26]	2021	DaiSEE	ResTCN	Online learning	63.9%
Sümer et al. [20]	2021	posture, expression	SVM, DNN	Traditional classroom	-
Liao et al. [27]	2021	DAiSEE	DFSTN	Online learning	58.84%
Mehta et al. [28]	2022	DAiSEE, EmotiW	DenseNet	Online learning	63.59%

Table 2. Distribution of online learning engagement dataset.

Cue	Student’s Performance
Low Engagement	The vision drifts or leaves the computer. The gaze is dull, the expression is sleepy, blinks more often, or appears to raise and lower the head and turn the head.
Engagement	The line of sight is basically on the screen, the eye movement is small, the line of sight jumps out of the screen and flashes back quickly, the number of blinks is high, the expression is normal, the head and body posture is more upright, and the line of sight returns to the screen quickly after the presence of a clear head-down keyboarding action.
High Engagement	The head posture remains stable, the eyes are focused on the screen, the eyes are wide open, the eyes stare at the screen for a long time, or the eyes swing regularly. The expression is more serious, there is a tight frown or pursed mouth movement, and the body leans forward significantly.

Table 3. Distribution of online learning engagement dataset.

Label	Train	Valid	Test	Sum
0	766	242	208	1216
1	1162	403	30	1869
2	1982	650	510	3142
sum	3910	1295	1022	6227

Table 4. Classification results obtained by BiLRCN, LRCN, ResTcn, C3D, Xception, and SlowFast. The best results are expressed in bold. 0: Low engagement; 1: Engagement; 2: High engagement.

Method	$P_{overall}$	$P_{0}$	$P_{1}$	$P_{2}$	$R_{0}$	$R_{1}$	$R_{2}$
BiLRCN (ours)	66.24%	46.63%	55.32%	73.12%	61.39%	52.96%	82.16
LRCN [31]	63.01%	57.85%	53.56%	68.65%	33.65%	51.97%	81.57%
ResTCN [26]	61.65%	54.55%	51.46%	67.09%	37.50%	41.45%	83.53%
C3D [33]	56.46%	62.50%	55.85%	56.40%	7.21%	25.33%	95.10%
Xception [34]	62.04%	46.93%	52.65%	71.83%	51.44%	39.14%	80.00%
SlowFast [35]	61.94%	47.56%	61.09%	67.05%	37.50%	50.00%	79.02%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.; Wei, Y.; Shi, Y.; Li, X.; Tian, Y.; Zhao, Z. Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks. Sustainability 2023, 15, 198. https://doi.org/10.3390/su15010198

AMA Style

Ma Y, Wei Y, Shi Y, Li X, Tian Y, Zhao Z. Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks. Sustainability. 2023; 15(1):198. https://doi.org/10.3390/su15010198

Chicago/Turabian Style

Ma, Yujian, Yantao Wei, Yafei Shi, Xiuhan Li, Yi Tian, and Zhongjin Zhao. 2023. "Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks" Sustainability 15, no. 1: 198. https://doi.org/10.3390/su15010198

APA Style

Ma, Y., Wei, Y., Shi, Y., Li, X., Tian, Y., & Zhao, Z. (2023). Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks. Sustainability, 15(1), 198. https://doi.org/10.3390/su15010198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks

Abstract

1. Introduction

1.1. Research Background

1.2. Learning Engagement and Its Measurement Methods

1.3. Video-Based Recognition of Learning Engagement

1.4. Dataset for Engagement Recognition

1.5. Problem Statement

1.6. Contributions

2. Dataset Construction

2.1. Data Collection

2.2. Data Annotation

3. Online Learning Engagement Recognition Method

3.1. Features Extraction

3.2. Sequence Learning

4. Experiment

4.1. Experimental Setting

4.2. Evaluation Metrics

4.3. Experimental Results

5. Discussion

5.1. Discussion of Experimental Results

5.2. Discussion of Learning Engagement Application

5.3. Discussion of Future Development

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI