Research on Students’ Action Behavior Recognition Method Based on Classroom Time-Series Images

: Students’ action behavior performance is an important part of classroom teaching evaluation. To detect the action behavior of students in classroom teaching videos, and based on the detection results, the action behavior sequence of individual students in the teaching time of knowledge points is obtained and analyzed. This paper proposes a method for recognizing students’ action behaviors based on classroom time-series images. First, we propose an improved Asynchronous Interaction Aggregation (AIA) network for student action behavior detection. By adding a Multi-scale Temporal Attention (MsTA) module and a Multi-scale Channel Spatial Attention (MsCSA) module to the fast pathway and slow pathway, respectively, the accuracy of student action behavior recognition is improved in SlowFast, which is the backbone network of the improved AIA network,. Second, the Equalized Focal Loss function is introduced to improve the category imbalance that exists in the student action behavior dataset. Experimental results on the student action behavior dataset show that the method proposed in this paper can detect different action behaviors of students in the classroom and has better detection performance compared to the original AIA network. Finally, based on the results of action behavior recognition, the seat number is used as the index to obtain the action behavior sequence of individual students during the teaching time of knowledge points and the performance of students in this period is analyzed.


Introduction
Classroom teaching activities have always been the focus of research in the field of education, and students are the main body of the activities.Recognizing and analyzing students' action behavior plays an important role in teaching evaluation [1].Student action behavior analysis can help teachers understand the learning process of students and is an important means of measuring learning effectiveness.Traditional methods of analyzing student action behaviors rely on manual assessment, manual recording, and manual coding to collect and interpret behavioral performance data, which have the disadvantages of strong coding subjectivity, small sample size, and high time-consumption [2].With the development of education informatization, smart classrooms are widely used; so, cameras and other devices in the classroom can be used to obtain massive data such as videos and images in the teaching process.For the acquired massive data, selecting the appropriate sampling scheme and decomposition algorithm can effectively compress the data while retaining the integrity of the data features [3].Moreover, extracting effective features from the data based on machine learning methods can be used to intelligently recognize students' action behaviors.In recent years, some researchers have used traditional algorithms such as wavelet transform and discrete path transform to extract features from data [4,5], Although the extracted features are more representative and generalized, the traditional methods still have problems such as manual intervention and poor robustness.In addition, some researchers have also applied fractal theory to the field of image processing [6].
Different from traditional machine learning methods, deep learning automatically learns feature representations of data such as videos and images, which is robust and can be used as an effective tool for detecting students' action behaviors [7].Tang et al. [8] proposed a student posture detection method based on an improved Faster RCNN object detection model to recognize the behaviors of standing, sitting, and sleeping.Liu et al. [9] proposed an action detection framework based on YOLOv5, which is used to detect the classroom behavior of students in the monitoring system under different backgrounds.Although the above studies achieved good results in student action behavior detection, they did not consider the temporal features of the behavior and the interaction with the spatial contextual environment, and there are limitations in understanding the action behavior based on a single-frame image.
Different from student action behavior recognition based on static images, video behavior recognition can effectively utilize time-series information.Xie et al. [10] proposed a college students' classroom behavior recognition algorithm based on spatiotemporal representation learning, which is used to recognize students' abnormal behaviors such as sleeping and playing mobile phones.However, this type of research is only applicable to action behavior recognition in single-person scenarios, which cannot mark the spatial location of each student and cannot be applied to action behavior detection in multi-person classroom scenarios.
The purpose of spatiotemporal behavior detection is to locate each person in the video and recognize their action behaviors.Many existing studies have improved the performance of action behavior detection by modeling the interactions between the target person and the contextual environment.Students' action behaviors usually interact with other students, objects, etc., around them, and modeling these interactions can be used to recognize the students' action behaviors more effectively.In addition, the background of the classroom scene is complex, the students are crowded, the pixel area occupied by the front row and the back row students in the video is different, and the visual speed of each student's action behavior is also different; so, the extraction of video features needs to be optimized for these problems.Based on the above analysis, this paper proposes a student action behavior recognition method based on classroom time-series image data.Firstly, the improved Asynchronous Interaction Aggregation (AIA) network [11] is used to detect the action behavior of students in classroom videos.Then, based on the recognition results, the action behavior sequence of individual students during the teaching time of knowledge points is obtained and analyzed.In addition, to validate the effectiveness of the proposed action behavior detection algorithm, a student action behavior dataset is constructed and Equalized Focal Loss (EFL) [12] is introduced to improve the category imbalance problem existing in the dataset.The main contributions of this paper are as follows: a.
We propose an action behavior detection model based on an improved AIA network [11].The Multi-scale Temporal Attention (MsTA) module and Multi-scale Channel Spatial Attention (MsCSA) module are added to the video backbone network SlowFast, which improves the accuracy of students' action recognition.b.
The EFL function [12] was introduced to dynamically adjust the categorization loss weights of different categories to improve the category imbalance problem existing in the dataset.c.
Experiments are conducted on a self-made student action behavior dataset.The experimental results show that the algorithm proposed in this paper improves the mean average precision (mAP) value of action behavior detection.Based on the results of action behavior recognition, the student's seat number is used as the index to analyze the sequence of students' action behavior during the teaching time of knowledge points.

Video Behavior Recognition
The research on video behavior recognition is broadly divided into behavior recognition, temporal behavior detection, and spatiotemporal behavior detection.Behavior recognition is to classify the input video; temporal behavior detection needs to detect the start time and end time of the behavior and recognize the behavior of the human in that time; spatiotemporal behavior detection needs to locate the spatial position of the human in the video and recognize the duration and category of the behavior.In this paper, multiple students in classroom teaching videos are used as research objects to detect their action behaviors; therefore, the spatiotemporal behavior detection method is used to detect students' action behaviors in this paper.
Video is different from images, detecting human behavior requires extracting both appearance features and motion features of the frame sequence.Based on these features, researchers have proposed many effective networks to detect human behavior in video.Feichtenhofer et al. [13] proposed a dual pathway network SlowFast, where the Slow pathway and Fast pathway are responsible for the extraction of the appearance features and motion features, respectively.Literature [14] proposed an efficient X3D network that reduces the amount of computation and, at the same time, gives better results in video behavior recognition.Such studies are based on a two-stage approach to recognize human behavior: one stage is used to generate the human bounding box and the other stage is used for behavior recognition.Recently, a researcher proposed a one-stage network to detect human behavior in video [15], which decouples detection and behavior recognition into two branches: one branch is responsible for detecting humans in video and the other is responsible for recognizing the behavior.
With the application of transformer in the image processing field, Fan et al. [16] proposed a Multi-scale Transformer, which combines the multiscale feature hierarchies with the transformer to extract the video features at different levels so that the model can better understand the video content.In addition, behavior recognition methods based on interaction relationship modeling are also widely used.Tang et al. [11] proposed an AIA Network that models human-human interaction, human-object interaction, and temporal interaction.Pan et al. [17] proposed the Actor-Context-Actor Relation (ACAR) Network, which explicitly models higher-order relationships between humans based on their interactions with the background information.Zheng et al. [18] proposed a Multi-Relational Support Network (MRSN), which first models actor-context and actor-actor relationships separately and then models the two types of interactions at the relational level.Faure et al. [19] proposed a multi-modal Holistic Interaction Transformation (HIT) Network, which contains two branches: the RGB branch and the pose branch.The two branches extract appearance features and motion features, respectively; finally, the features extracted from the two branches are fused and input to the classification layer to detect human behavior.

Behavior Recognition in Classroom Scenarios
In recent years, there have been many scholars applying machine learning to students' action behavior recognition, which mainly focuses on recognizing students' action behaviors through data modalities such as human keypoints, still images, and videos.Therefore, this paper will introduce related research from these aspects.
In the study based on human keypoints, Zhang et al. [20] proposed a method to recognize students' postures in the classroom, which used an improved HRNet network to extract students' keypoints and then used a support vector machine to classify students' classroom behaviors.Zhou et al. [21] extracted the key information of human skeleton from student behavioral images and recognized students' classroom behaviors based on a deep convolutional neural network (CNN-10).Pang et al. [22] improved the traditional algorithm by combining the traditional cluster analysis algorithm and random forest algorithm with the human skeleton model to recognize students' classroom behaviors in real-time.Ding et al. [23] used OpenPose to extract human keypoints and used a graph convolutional neural network to classify the features and recognize the abnormal behaviors of students.
In their study based on static images, Wu et al. [24] proposed a motion object detection algorithm for student behavior recognition in the classroom, recognizing students' standing behaviors based on the region of interest (ROI) and face tracking, and students' hand-raising behaviors based on skin color detection.Banerjee et al. [25] proposed an improved SSD object detection model to recognize the behavior of students and teachers in the teaching laboratory.Liu et al. [26] improved the two-stage object detection network and proposed a new ROI extractor SAA module and a new detection head RST module for student behavior detection in classroom scenes.Huang et al. [27] constructed a deep neural network to extract facial keypoints, recognize students' head postures and expressions, and classify classroom behaviors by combining the head gestures and expressions.Zheng et al. [28] improved the CBL module of the YOLOv5 model to detect students' classroom behavior from multiple perspectives to evaluate students' classroom attention.
In the study based on videos, Liu et al. [29] proposed a 3D multi-scale residual dense network based on heterogeneous view perception for recognizing students' classroom behaviors.Jisi et al. [30] combined a spatial affine transform network with a convolutional neural network to extract the spatiotemporal features of the video and fused the spatiotemporal features to classify students' behaviors by using a weighted sum method.
Although the above studies have achieved good results in the field of student action behavior detection, there are still some shortcomings: (1) human keypoint-based studies have difficulties in extracting human keypoints in student-intensive classroom scenarios and it is difficult to accurately extract the keypoints of each student; (2) static-image-based studies ignore the temporal features of the action behaviors and do not incorporate timeseries context information; (3) video-based studies incorporate the time-series context information of action behaviors but ignore the interaction between students and the context of the environment; these methods only recognize individual action behaviors and do not apply to real multi-person classroom scenarios.
To address the above issues, this paper uses a spatiotemporal action detection model based on interaction relationship modeling to locate students in classroom videos while recognizing their action behaviors and optimizes the model for the difficulties existing in the classroom scenarios; based on the results of action behavior recognition, the student's seat number is used as the index to obtain the action behavior sequence of individual students during the teaching time of knowledge points to help teachers understand the students' learning process and take personalized intervention measures to improve the teaching effect.

Proposed Method
The overall system structure diagram of the student action behavior recognition model proposed in this paper is shown in Figure 1.Firstly, the spatiotemporal features of classroom video are extracted based on the backbone network in the improved AIA network.Secondly, students and objects such as books and mobile phones are detected by using the object detection model applicable to the classroom scenario; then, the spatiotemporal features of the video and the spatial coordinate information of the students are fused to detect the student's action behaviors.Finally, based on the action behavior recognition results during the teaching time of the knowledge points, the student's seat number is used as the index to correlate the student's action behavior sequence during the period.

Detection of Students' Action Behavior
The AIA network [11] has achieved good results in spatiotemporal action behavior detection; however, the network uses a fixed convolutional kernel to learn video features, which cannot capture the spatiotemporal information of multi-scale feature maps to enrich the feature space, and the detection accuracy of the small sample categories is lower in a dataset with category imbalance.Therefore, in a real classroom environment, the AIA network does not have high accuracy in detecting action behaviors with small sample sizes and insignificant features.To improve the detection accuracy, this paper proposes an improved AIA network: firstly, it incorporates the MsCSA module into the slow pathway of the backbone network SlowFast [13], and incorporates the MsTA module into the fast pathway of SlowFast, which helps the network to extract the multiscale spatiotemporal features; second, the EFL function [12] is introduced to improve the detection performance of action-behavior categories with small sample sizes.The improved AIA network structure is shown in Figure 2 and consists of three parts: The first is part a, which uses an independent object detection network to detect students and objects in video keyframes; next is part b where, in this paper, the MsCSA module is added to the Res2, Res3, and Res4 stages of the slow pathway of the SlowFast network [13] and the MsTA module is added to the Res3, Res4, and Res5 stages of the fast pathway of the SlowFast network to help the network extract multi-scale spatial features and multi-scale temporal features.Finally, the AIA module in part c consists of the Asynchronous Memory Update (AMU) Algorithm and the Interaction Aggregation (IA) structure.The ROI Align algorithm [31] is used to extract the feature P t of the student bounding box and the feature O t of the object bounding box; store and read the memory feature M t online using the AMU algorithm; input P t , O t , and M t into the IAstructure to model multiple types of interaction relationships; and then classify the fused features to detect the student's action behavior.

Detection of Students' Action Behavior
The AIA network [11] has achieved good results in spatiotemporal action behavior detection; however, the network uses a fixed convolutional kernel to learn video features, which cannot capture the spatiotemporal information of multi-scale feature maps to enrich the feature space, and the detection accuracy of the small sample categories is lower in a dataset with category imbalance.Therefore, in a real classroom environment, the AIA network does not have high accuracy in detecting action behaviors with small sample sizes and insignificant features.To improve the detection accuracy, this paper proposes an improved AIA network: firstly, it incorporates the MsCSA module into the slow pathway of the backbone network SlowFast [13], and incorporates the MsTA module into the fast pathway of SlowFast, which helps the network to extract the multiscale spatiotemporal features; second, the EFL function [12] is introduced to improve the detection performance of action-behavior categories with small sample sizes.The improved AIA network structure is shown in Figure 2 and consists of three parts: The first is part a, which uses an independent object detection network to detect students and objects in video keyframes; next is part b where, in this paper, the MsCSA module is added to the Res2, Res3, and Res4 stages of the slow pathway of the SlowFast network [13]

Video Backbone Network
The video backbone network SlowFast is the b part of Figure 2. Literature [13] categorizes SlowFast into slow and fast pathways based on the difference in the sampling step of the frames, where the number of convolutional blocks in the residual layer is the same and only the number of convolutional kernels and output channels are different.The sampling step for slow pathway frames is τ = 16, 1 frame every 16 frames, and the number of output channels is C to learn spatial semantic information with fewer frames and a larger number of channels; the sampling step for fast pathway frames is τ/α(α = 8), 1 frame every 2 frames, and the number of output channels is βC β = 1 8 to learn the motion information with a larger number of frames and fewer number of channels.Lateral connections are established with the slow pathway in the Res2, Res3, and Res4 stages of the fast pathway, and the features of the fast pathway are fused with the features of the slow pathway after reconstruction.
Different from the traditional SlowFast network, this paper adds an MsCSA module after the Res2, Res3, and Res4 stages of the slow pathway and an MsTA module after the Res3, Res4, and Res5 stages of the fast pathway to extract multi-scale spatial features and multi-scale temporal features of the classroom teaching video and, thus, to improve the detection of the students' action behavior accuracy.In this paper, these two modules will be introduced in detail in Sections 3.1.4and 3.1.5.

Feature Extraction
The video clip v t is taken as input and the video features f t ∈ R C×T×H×W are extracted using the backbone network described in Section 3.1.1,where C, T, H, and W are channel, time, height, and width, respectively.The video features f t are average pooled along the time dimension to obtain the feature map I t ∈ R C×H×W .Meanwhile, the Faster RCNN [32] is used to locate students in the keyframes (i.e., the center frames) of the video clip v t , as well as objects, such as books and mobile phones, and obtains N t student bounding boxes and K t object bounding boxes.Based on the detected bounding boxes, the ROI Align algorithm [31] is used to extract the student features of the video clip v t along the spatial dimension from the feature map I t .In addition, to model the long-term temporal context between different clips, the student features of multiple clips are deposited into the feature pool using the AMU algorithm and the student features of the neighboring clip are read from the feature pool each time of training; then, they are combined with those of the current clip to form a memory feature where L is the size of the time window.Thus, the memory feature M t contains long-term semantics, which can provide useful temporal semantics to help recognize temporally relevant action behaviors such as reading and writing.

Modeling and Aggregation of Interactions
In addition to spatial and temporal features, student self-interactions, student-student interactions, student interactions with objects such as books and mobile phones, and longterm temporal interactions of the same student are crucial for understanding the student's action behaviors.Given different student features P t , object features O t , and memory features M t , the IA structure can fuse these features to model and aggregate the above interaction types for more accurate action behavior detection.
The IA structure consists of multiple interaction blocks improved based on transformer blocks [33], each modeling a single type of interaction through an attention mechanism.Interaction blocks are divided into three types in total: P-Block, O-Block, and M-Block.P-Block is used to model human-human interaction (including self-interaction) in the same clip, and its query and key/value are both student features or enhanced student features; O-Block is used to model human-object interaction, and its key/value input is the object features O t .Figure 3a shows an illustration of an O-Block; M-Block is used to model the same human at different frames of time interaction, and its key/value input is the memory features M t .
The IA structure consists of multiple interaction blocks improved based on transformer blocks [33], each modeling a single type of interaction through an attention mechanism.Interaction blocks are divided into three types in total: P-Block, O-Block, and M-Block.P-Block is used to model human-human interaction (including self-interaction) in the same clip, and its query and key/value are both student features or enhanced student features; O-Block is used to model human-object interaction, and its key/value input is the object features t O . Figure 3a shows an illustration of an O-Block; M-Block is used to model the same human at different frames of time interaction, and its key/value input is the memory features In a Dense Serial IA structure, each interaction block accepts the output of all previous interaction blocks, and different types of interactions interact with each other and are aggregated using learnable weights.The query for the i th block can be represented as In a Dense Serial IA structure, each interaction block accepts the output of all previous interaction blocks, and different types of interactions interact with each other and are aggregated using learnable weights.The query for the ith block can be represented as where denotes the element-wise multiplication, C is the set of indices of previous blocks, W j is a learnable d-dimensional vector normalized with a Softmax function among C, and E t,j is the enhanced output features from the jth block.Dense Serial IA is illustrated in Figure 3b.The fused features from the Dense Serial IA structure are fed into the classifier to detect the student's action behavior.

MsTA Module
In the field of video action recognition, existing research uses fixed convolution kernels or operations to learn the temporal features of videos.The AIA network used in this paper also uses fixed 3 × 1 × 1 convolution to learn the temporal features of students' action behaviors in the classroom scene.However, different action behaviors last for different durations; so, it is crucial to capture the multi-scale temporal features.This paper proposes a Multi-scale Temporal Convolution Unit (MsTCU) with different temporal convolution kernels, which utilizes different temporal sliding windows to extract the multi-scale temporal information.
The Double Attention (DA) block of A 2 -Net [34] is applied to the fast pathway of the backbone network of the action detection model in this paper and cascaded with the abovementioned MsTCU to form an MsTA module, which extracts the local temporal features and global temporal features to model the long-distance interdependencies.The MsTCU and the DA block are described below.
MsTCU.As shown in Figure 4, the feature map X ∈ R C×T×H×W first passes through the three branches of MsTCU, which are equipped with different temporal convolution kernels, respectively, and perform temporal convolution processing on the feature map X to obtain multi-scale temporal features as follows: tures and global temporal features to model the long-distance interdependencies.The MsTCU and the DA block are described below.
MsTCU.As shown in Figure 4, the feature map first passes through the three branches of MsTCU, which are equipped with different temporal convolution kernels, respectively, and perform temporal convolution processing on the feature map X to obtain multi-scale temporal features as follows: ,, X X X    .Afterward, the feature maps of the three branches are summed up element by element: (2) To maintain spatiotemporal consistency, F extracts the spatial features of the video through three dilated convolutional branches with convolutional kernel size 3 and dilation rates 1, 2, and 3, respectively.The output feature maps are represented using ,, X X X    .Afterward, the feature maps of the three branches are concatenated to- gether and the channel is reduced in dimension by 111  convolution: (1) X ∈ R C×T×H×W passes through three branches with temporal convolution kernel sizes of 3, 5, and 7, respectively, and the dimensionality of the output feature maps is constant in size, denoted as X 0 , X 1 , X 2 Afterward, the feature maps of the three branches are summed up element by element: (2) To maintain spatiotemporal consistency, F 1 extracts the spatial features of the video through three dilated convolutional branches with convolutional kernel size 3 and dilation rates 1, 2, and 3, respectively.The output feature maps are represented using X 0 , X 1 , X 2 .Afterward, the feature maps of the three branches are concatenated together and the channel is reduced in dimension by 1 × 1 × 1 convolution: where f 1×1×1 denotes the 1 × 1 × 1 convolution, F 2 ∈ R C×T×H×W .DA block.As shown in Figure 5, let the input be X ∈ R C×T×H×W and the local feature of each spatiotemporal location i = 1, • • • , THW be v i .Then, define Equation (4):  ( ) where A and B can be the feature maps from the same layer, i.e., AB = , or from two different layers, i.e., There are two operations in the equation.First, the G gather operation adaptively aggregates the features of the global context; then, the F distr operation assigns the aggregated global context features back to each location i based on the local features v i and outputs z i .z i can be obtained by steps (1) and ( 2): (1) Feature Gathering.The DA block uses bilinear pooling to capture the second-order statistics of features to generate a global representation.Bilinear pooling is the summation of the outer product of all feature vectors (a i , b i ) in the two input feature maps A and B: In CNNs, A and B can be the feature maps from the same layer, i.e., A = B, or from two different layers, i.e., A = φ X; W φ and B = θ(X; W θ ), with parameters W φ and W θ .
By introducing the output variable Equation (5) shows that G can be viewed as a collection of visual elements in a sequence of frames, where each subset g i is obtained by gathering local features weighted by b i ; then, further applying a softmax onto B to ensure ∑ j b ij = 1, i.e., a valid attention weighting vector, gives the following second-order attention pooling process: (2) Feature Distribution.After the global features are collected from the frame sequence, the input X is first transformed by a convolutional layer to obtain the feature map V.The elements within V are normalized using the softmax function, i.e., V = so f tmax ρ X; W ρ , and Wρ are parameters of the convolutional layer; then, the attention vectors v i at each position of the feature map V are multiplied by the global features G and a subset of the global feature vectors can be adaptively assigned according to the weight magnitude of each vector v i , as shown in Equation (8): (3) Substituting Equations ( 7) and (8) into Equation ( 4), the double attention operation can be expressed by Equation ( 9): Based on the above process of double attention operation, the computational diagram of the DA block is briefly summarized as follows: the input feature map X ∈ R C×T×H×W is passed through three different convolutional layers to obtain the feature maps A, B, and V. B and V need to be softmax normalized.A and B are subjected to bilinear pooling operation and then multiplied by Finally, residual connections are added to the MsTCU and DA block, respectively, to ensure the propagation of information across layers, as shown in Figure 6: where X is the input feature map and Y is the output feature map of MsTCU (DA block).
convolution is used to reshape . Finally, residual connections are added to the MsTCU and DA block, respectively, to ensure the propagation of information across layers, as shown in Figure 6: where X is the input feature map and Y is the output feature map of MsTCU (DA block).The video data captured in this study have a large resolution; a large difference in pixel size occupied by front and back row students in space; and the need to extract small object features t O , such as books and mobile phones around the students, along with other types of features, to model the interaction.Literature [35] shows that multi-scale receptive field is helpful for the network to notice the spatial position of different size objects.. Therefore, a Multi-scale Spatial Attention (MsSA) block is added to the slow pathway of the video backbone network to obtain contextual information of a larger receptive field; at the same time, to allow the network to learn the weight response of the feature maps of different channels in the process, the MsSA block is cascaded with Gaussian Context Transformer (GCT) [36] to form an MsCSA module.The MsCSA module is shown in Figure 7.

MsCSA Module
The video data captured in this study have a large resolution; a large difference in pixel size occupied by front and back row students in space; and the need to extract small object features O t , such as books and mobile phones around the students, along with other types of features, to model the interaction.Literature [35] shows that multi-scale receptive field is helpful for the network to notice the spatial position of different size objects.Therefore, a Multi-scale Spatial Attention (MsSA) block is added to the slow pathway of the video backbone network to obtain contextual information of a larger receptive field; at the same time, to allow the network to learn the weight response of the feature maps of different channels in the process, the MsSA block is cascaded with Gaussian Context Transformer (GCT) [36] to form an MsCSA module.The MsCSA module is shown in Figure 7. GCT block.The GCT block first performs the global average pooling operation on the input feature map X in the spatial dimension; then, it stabilizes the distribution of the global features by normalizing the channel vectors; finally, it obtains the attention map by using the Gaussian function to perform the excitation operation on the normalized global features.The specific process is as follows: (1) Global Context Aggregation (GCA).Let the input R  ; the GCA can be expressed as where CT is the number of channels, and H and W are the height and width of the feature map, respectively.
(2) Normalize.The GCT block computes the attentional activation value of the channel using the function ( ) f  .Define ẑ as the input to the function ( ) f  , and ẑ is rep- resented by Equation ( 12): ( ) where denotes the mean of the global context z , and z  − is the GCT block.The GCT block first performs the global average pooling operation on the input feature map X in the spatial dimension; then, it stabilizes the distribution of the global features by normalizing the channel vectors; finally, it obtains the attention map by using the Gaussian function to perform the excitation operation on the normalized global features.The specific process is as follows: (1) Global Context Aggregation (GCA).Let the input X ∈ R C×T×H×W reshape X into R CT×H×W ; the GCA can be expressed as where CT is the number of channels, and H and W are the height and width of the feature map, respectively.
(2) Normalize.The GCT block computes the attentional activation value of the channel using the function f (•).Define ẑ as the input to the function f (•), and ẑ is represented by Equation ( 12): where µ = 1 C ∑ C k=1 z k denotes the mean of the global context z, and z − µ is the mean shift. 2 + ε denotes the standard deviation of the global context z, ε is a very small constant, and σ is introduced so that ẑ remains in distribution with mean 0 and variance 1 for different input samples, thus ensuring that the output of f (•) is stable.Equation ( 11) is consistent with the result of normalizing z and, thus, can be expressed as ẑ = norm(z).
(3) Gaussian Context Excitation (GCE).The GCT block substitutes the Gaussian function into f ( ẑ) and defines the GCE operation as where c is a constant or learnable parameter and g is the attentional activation value.
The above steps are combined to form the GCT block, which is represented by Equation ( 14): where F c ∈ R CT×H×W is the feature map output by the GCT block.MsSA block.After the GCT block, F c ∈ R CT×H×W is input to the Multi-scale Spatial Convolution Unit (MsSCU) in the MsSA block to extract the spatial information at different scales.The process is shown below: (1) F c undergoes dilation convolution with a convolution kernel size of 3 × 3 and dilation rates of 1, 2, 3, and 4, respectively; then, F c is sliced into four parts, denoted using [X 1, X 2 , X 3 , X 4 ], with the number of channels in each part being C = CT r , and then concatenated in the channel dimension (2) r ×H×W undergoes a 3 × 3 convolution to further fuse the multi-scale features and recover the original number of channels CT to obtain the feature map F 2 ∈ R CT×H×W : where f 3×3 denotes the 3 × 3 convolution operation in the spatial dimension; for ease of representation, the above process is denoted as F 2 = MsSCU(F c ).After two MsSCU, the spatial attention feature map is then obtained by the sigmoid function where F s ∈ R CT×H×W , σ is the sigmoid function, and F c and F s are multiplied element by element to obtain the feature map Y : Reshape Y to Y ∈ R C×T×H×W to keep the size of the input constant.Similar to the MsTA module, residual connections are added to the MsCSA module to ensure the propagation of information across layers.

Equalized Focal Loss Function
The student action behavior dataset used in this paper suffers from a category imbalance problem, whereby most of the students show action behaviors such as listening and reading during the class period while a few students have action behaviors such as using mobile phones and lying on the table.The loss function used in the original AIA network is Focal Loss [37], but Li et al. [12] proved that Focal Loss does not deal with the foreground category imbalance problem; so, in this paper, we use the EFL proposed by Li et al. [12] as the loss function of the action classifier in the AIA network to improve the category imbalance problem, specifically shown by Equation ( 19): where α t is used to regulate the weights of positive and negative samples during training and p t is the prediction score.
is the weight factor associated with the category, is the focusing factor for the jth category, γ b is a constant, and γ j v can be expressed as where the hyper-parameter s is the scaling factor that determines the upper limit of γ j v in EFL, and the parameter g j indicates the accumulated gradient ratio of positive samples to negative samples of the jth category during the training process, with the range of g j set to [0,1].
Since g j indicates the accumulated gradient ratio of positive samples and negative samples of the jth category during the training process, a larger value of g j indicates that the category is trained to be balanced and a smaller value of g j indicates that the category is trained to be unbalanced; so, the focusing factor γ b + γ j v and the prediction score p t in the EFL, composed of (1 − p t ) γ b +γ j v as the category-related weighting factor in the loss, can dynamically regulate the weight of the loss based on the cumulative gradient ratio of positive and negative samples in each category during the training process and the prediction score p t to handle the problem of category imbalance.

Seat-Association-Based Analysis of Students' Action Behavior Sequence
The application scenario of this paper is in the classroom.The classroom scenario is densely populated with students, mutual occlusion, and the pixel area occupied by the face region in the video frame is small; thus, it is difficult for the face recognition technology to associate the student object across the video frames.During the class period, the position of students is fixed and the position shift of individual students in the video frame sequence is small; therefore, this paper associates the same student object through the student's seat across the video frames, tracks the changes of the same student's action behaviors in a continuous period, and obtains the action behavior sequence of individual students in the teaching time of knowledge points: (1) Students' use of electronic devices in the classroom cannot be simply classified as playing with a mobile phone; so, this paper first obtains the recognition result set G A of the action behavior detection model within the teaching time of knowledge point k.The number of students who use mobile phones in the set G A is counted; if the ratio of the number of students is greater than 0.6, it is considered that the teacher published an in-class test and classifies the behavior as reading.
(2) In this paper, we first use the LabelImg tool to label a rectangular box for each student's seat in the classroom and save its coordinate information; then, we extract the student's bounding box information from G A .We use the method of calculating the Complete Intersection over Union (CIoU) in the literature [38] to calculate the CIoU value of the seat rectangular box and the student's bounding box to measure the overlap between the two boxes, match the student's seat based on the maximum overlap, and then obtain the sequence of individual student's action behaviors by using the seat number as the index.

Datasets
Dataset 1: The AVA dataset [39] is a dataset oriented to the spatiotemporal action detection task.This dataset takes 1 frame per second as a keyframe to be labeled.There are 80 atomic action classes, including three major classes: posture actions, actions of human-human interaction, and actions of human-object interaction.
Dataset 2: UCF101-24 is a dataset oriented to the spatiotemporal action detection task.This dataset is labeled frame-by-frame and contains a total of 24 classes of actions.
Dataset 3: At present, there is no public student action behavior dataset; so, this paper constructs a dataset of students' action behaviors based on real classroom videos and annotated concerning the AVA dataset [39], which contains seven types of action behaviors, including listening, looking around, lying on the table, reading, writing, using mobile phone, and talking, as shown in Figure 8.The construction process of the dataset is as follows:

Datasets
Dataset 1: The AVA dataset [39] is a dataset oriented to the spatiotemporal action detection task.This dataset takes 1 frame per second as a keyframe to be labeled.There are 80 atomic action classes, including three major classes: posture actions, actions of human-human interaction, and actions of human-object interaction.
Dataset 2: UCF101-24 is a dataset oriented to the spatiotemporal action detection task.This dataset is labeled frame-by-frame and contains a total of 24 classes of actions.
Dataset 3: At present, there is no public student action behavior dataset; so, this paper constructs a dataset of students' action behaviors based on real classroom videos and annotated concerning the AVA dataset [39], which contains seven types of action behaviors, including listening, looking around, lying on the table, reading, writing, using mobile phone, and talking, as shown in Figure 8.The construction process of the dataset is as follows: Step 1: Collect real classroom videos from a university; screen and edit the videos; and select a total of 25 videos, each with a length of 5 min.
Step 2: Firstly, video frames are extracted according to the frame rate of 30 frames per second; then, every 30 frames are extracted as keyframes, which are used to label the students' positions and action behaviors.
Step 3: Use the VGG Image Annotation tool to label the position and action behavior categories of students in the keyframes, save the labeling information in CSV file format, and then process it into labeling files in AVA format.In addition, to model the interaction between students and the environment in the classroom video, it is necessary to correctly find the objects that students interact with.Therefore, this paper trains a target detection model based on YOLOv5 that is suitable for classroom scenes and then uses the model to detect objects such as cell phones and books in key frames, extracts coordinate information and category indexes of cell phones and books based on the detection results, and generates labeled files in the format of the COCO dataset to be used as auxiliary training data.

Experimental Results on the Public Dataset
In the experiments on the AVA dataset, the model was trained using the pretraining parameters of the Kinetics-700 dataset for weight initialization.The input to the network is sixty-four frames, 8  = , 16  = , i.e., thirty-two frames for the fast pathway and four frames for the slow pathway.To reduce the number of parameters, the short edges of the video frames are cropped to 256, two GPUs are used for training, and the SGD optimization algorithm is used with a batch size of four and an initial learning rate of 0.00025, which is adjusted in the first 2k iterations using linear warm-up.In this paper, we refer to Step 1: Collect real classroom videos from a university; screen and edit the videos; and select a total of 25 videos, each with a length of 5 min.
Step 2: Firstly, video frames are extracted according to the frame rate of 30 frames per second; then, every 30 frames are extracted as keyframes, which are used to label the students' positions and action behaviors.
Step 3: Use the VGG Image Annotation tool to label the position and action behavior categories of students in the keyframes, save the labeling information in CSV file format, and then process it into labeling files in AVA format.
In addition, to model the interaction between students and the environment in the classroom video, it is necessary to correctly find the objects that students interact with.Therefore, this paper trains a target detection model based on YOLOv5 that is suitable for classroom scenes and then uses the model to detect objects such as cell phones and books in key frames, extracts coordinate information and category indexes of cell phones and books based on the detection results, and generates labeled files in the format of the COCO dataset to be used as auxiliary training data.

Experimental Results on the Public Dataset
In the experiments on the AVA dataset, the model was trained using the pretraining parameters of the Kinetics-700 dataset for weight initialization.The input to the network is sixty-four frames, α = 8,τ = 16, i.e., thirty-two frames for the fast pathway and four frames for the slow pathway.To reduce the number of parameters, the short edges of the video frames are cropped to 256, two GPUs are used for training, and the SGD optimization algorithm is used with a batch size of four and an initial learning rate of 0.00025, which is adjusted in the first 2k iterations using linear warm-up.In this paper, we refer to the and loss values of the proposed method during the experimental process.The results show that the improved AIA network in this paper converges stably during the training process and achieves 92% accuracy., the batch size is set to four, the initial learning rate is 0.00025, and the input of the network is 64 frames.Figure 9 shows the curves of the accuracy and loss values of the proposed method during the experimental process.The results show that the improved AIA network in this paper converges stably during the training process and achieves 92% accuracy.To verify the effectiveness of the different modules added to the improved AIA network and the improved loss function, Table 3 lists the experimental results with the addition of the MsTA module and the MsCSA module, as well as the comparative results of replacing the loss function with the EFL function [12] and the EQLv2 function [42].As can be seen from Table 3, after integrating the MsTA module and the MsCSA module into the AIA network, the mAP values of student action detection improved.Students' different action behaviors during class have different durations and visual speeds, and the MsTA module can extract action behavior features with different time scales.For example, when students are listening carefully, they usually look up and face the teacher, and there is not much posture change between the current frame and the history frame; however, when students are looking around, the module needs to combine the temporal features of multi-frames to recognize the student's "looking around" behavior.When students are reading, writing, or using mobile phones, they will turn the book, take a pen, or touch the book or mobile phone and the MsCSA module expands the receptive field in the space and enhances the response of the mobile phone, book, and other objects in the feature map, thus enhancing the interaction features of the student and the object so that To verify the effectiveness of the different modules added to the improved AIA network and the improved loss function, Table 3 lists the experimental results with the addition of the MsTA module and the MsCSA module, as well as the comparative results of replacing the loss function with the EFL function [12] and the EQLv2 function [42].As can be seen from Table 3, after integrating the MsTA module and the MsCSA module into the AIA network, the mAP values of student action detection improved.Students' different action behaviors during class have different durations and visual speeds, and the MsTA module can extract action behavior features with different time scales.For example, when students are listening carefully, they usually look up and face the teacher, and there is not much posture change between the current frame and the history frame; however, when students are looking around, the module needs to combine the temporal features of multi-frames to recognize the student's "looking around" behavior.When students are reading, writing, or using mobile phones, they will turn the book, take a pen, or touch the book or mobile phone and the MsCSA module expands the receptive field in the space and enhances the response of the mobile phone, book, and other objects in the feature map, thus enhancing the interaction features of the student and the object so that the network better recognizes the interaction behaviors such as reading, writing, or using mobile phones.

Student Action
In addition, due to the problem of category imbalance in the dataset, there are fewer samples of behaviors such as lying on the table, using a mobile phone, and talking, and more samples of behaviors such as listening and reading.Both the EQLv2 function and the EFL function are used to improve the problem by adjusting the loss weights of the different categories.The experimental results in Table 3 show that the EFL function performs better in the study of this paper; so, the EFL is chosen as the improved AIA network's loss function.
Figure 10 shows the AP values of seven student action behaviors for the method proposed in this paper and the AIA network.The original AIA network has a good recognition effect for action behaviors with a large number of samples, such as listening, reading, and writing; however, the recognition accuracy of action behaviors with a small number of samples, such as lying on the table, using mobile phones, and talking, is low.The improved AIA network significantly improves the detection accuracy of the four categories of action behaviors such as looking around, lying on the table, using mobile phones, and talking, indicating that the method proposed in this paper can not only extract spatiotemporal features that are beneficial to the recognition of action behaviors but also deal with the problem of imbalance of the foreground category that exists in dataset 2, which has a greater improvement in the recognition accuracy of the categories of action behaviors with fewer samples.Although there are few sample instances of lying on the table, the features are obvious, and the detection accuracy reaches 80.7%.The detection accuracy of "talk" behavior is improved by 5.7% compared with that of the AIA network but is still significantly lower than the other action behaviors, mainly because the application scenario of this paper is densely populated with students and the head region of the students occupies a smaller area in the image, with less obvious features.
In addition, due to the problem of category imbalance in the dataset, there are fewer samples of behaviors such as lying on the table, using a mobile phone, and talking, and more samples of behaviors such as listening and reading.Both the EQLv2 function and the EFL function are used to improve the problem by adjusting the loss weights of the different categories.The experimental results in Table 3 show that the EFL function performs better in the study of this paper; so, the EFL is chosen as the improved AIA network's loss function.
Figure 10 shows the AP values of seven student action behaviors for the method proposed in this paper and the AIA network.The original AIA network has a good recognition effect for action behaviors with a large number of samples, such as listening, reading, and writing; however, the recognition accuracy of action behaviors with a small number of samples, such as lying on the table, using mobile phones, and talking, is low.The improved AIA network significantly improves the detection accuracy of the four categories of action behaviors such as looking around, lying on the table, using mobile phones, and talking, indicating that the method proposed in this paper can not only extract spatiotemporal features that are beneficial to the recognition of action behaviors but also deal with the problem of imbalance of the foreground category that exists in dataset 2, which has a greater improvement in the recognition accuracy of the categories of action behaviors with fewer samples.Although there are few sample instances of lying on the table, the features are obvious, and the detection accuracy reaches 80.7%.The detection accuracy of "talk" behavior is improved by 5.7% compared with that of the AIA network but is still significantly lower than the other action behaviors, mainly because the application scenario of this paper is densely populated with students and the head region of the students occupies a smaller area in the image, with less obvious features.In Figure 11, the improved AIA network model can detect students' action behaviors such as listening, reading, and using mobile phones during class.Since the number of students in the classroom is large and the boxes are dense, Figure 12 shows an example of the detection of action behaviors in a local area.

Analysis of Students' Action Behavior Sequences
The action behavior of the students in the classroom is detected by the action behavior recognition method described in Section 3.1.Then, each student's seat is matched using the method described in Section 3.With the seat number as the index, the changes in students' action behavior are tracked, and the sequence of students' action behavior in the teaching time of knowledge points k is obtained, as shown in Table 5. Listening, writing, reading, looking around, using mobile phones, lying on the table, and talking are represented by 3, 2, 1, −1, −2, −3, and −4, respectively.

Analysis of Students' Action Behavior Sequences
The action behavior of the students in the classroom is detected by the action behavior recognition method described in Section 3.1.Then, each student's seat is matched using the method described in Section 3.With the seat number as the index, the changes in students' action behavior are tracked, and the sequence of students' action behavior in the teaching time of knowledge points k is obtained, as shown in Table 5. Listening, writing, reading, looking around, using mobile phones, lying on the table, and talking are represented by 3, 2, 1, −1, −2, −3, and −4, respectively.With the seat number as the index, the changes in students' action behavior are tracked, and the sequence of students' action behavior in the teaching time of knowledge points k is obtained, as shown in Table 5. Listening, writing, reading, looking around, using mobile phones, lying on the table, and talking are represented by 3, 2, 1, −1, −2, −3, and −4, respectively.From Table 5, it can be seen that analyzing the sequence of students' action behavior in the teaching time of knowledge point k can understand the behavior performance of students in class and help teachers to find students with negative behaviors, for example, if the student whose seat number is (5, 10) continues to use a mobile phone during the period and is in a state of wandering, the teacher should remind the student after class so that the student can listen to the lectures attentively during the class period and review what the teacher has said promptly after the class.

Conclusions
Recognizing and analyzing students' action behaviors is of great significance in the research of teaching feedback and improving students' learning effectiveness.In this paper, we propose a student action detection model based on an improved AIA network.To improve the detection accuracy of the model, we add an MsTA module to the fast pathway of the video backbone network, and an MsCSA module to the slow pathway, to efficiently extract the multi-scale temporal and spatial information.The EFL function is introduced to improve the category imbalance problem that exists in the action behavior dataset.The experimental results show that the improved AIA network in this paper can detect different action behaviors of students and has higher detection accuracy compared with the original network.In addition, correlating students' action behavior sequences with seat numbers as indexes can find the students who have negative action behaviors during the class period, which helps teachers to understand students' learning efficiency during the class period.In the future, data such as students' facial expressions will be further combined to jointly analyze students' learning emotions.Meanwhile, more data are collected to create a rich and diverse dataset.In addition, this paper plans to deploy the algorithm to embedded devices for use in smart classrooms to help teachers understand students' learning and improve their learning outcomes.

22 Figure 1 .
Figure 1.Overall system structure diagram of student action behavior recognition model.

M
and the MsTA module is added to the Res3, Res4, and Res5 stages of the fast pathway of the SlowFast network to help the network extract multi-scale spatial features and multi-scale temporal features.Finally, the AIA module in part c consists of the Asynchronous Memory Update (AMU) Algorithm and the Interaction Aggregation (IA) structure.The ROI Align algorithm [31] is used to extract the feature t P of the student bounding box and the feature t O of the object bounding box; store and read the memory feature t into the IAstructure to model multiple types of interaction relationships; and then classify the fused features to detect the student's action behavior.

)
Appl.Sci.2023, 13, x FOR PEER REVIEW 9 are two operations in the equation.First, the gather G operation adaptively aggregates the features of the global context; then, the distr F operation assigns the aggregated global context features back to each location i based on the local features i v and outputs i z .i z can be obtained by steps (1) and (2):

Figure 5 .
Figure 5. DA block.(1) Feature Gathering.The DA block uses bilinear pooling to capture the second-order statistics of features to generate a global representation.Bilinear pooling is the summation of the outer product of all feature vectors ( ) , ii ab in the two input feature maps
m×n of the bilinear pooling and rewriting the second feature B as B = b 1 , • • • , b n , where each b i is a thw dimensional row vector, we can reformulate Equation (5) as

Figure 8 .
Figure 8. Example of student action behavior in the classroom.

Figure 8 .
Figure 8. Example of student action behavior in the classroom.

4. 3 . 1 .
Experimental Results and Analysis of the Student Action Behavior DatasetIn the experiments on the student action behavior dataset, in this paper, the video frames are cropped to 640 640 

Figure 9 .
Figure 9. Accuracy and loss curves of the improved AIA network: (a) accuracy curve; (b) loss value curve.

Figure 9 .
Figure 9. Accuracy and loss curves of the improved AIA network: (a) accuracy curve; (b) loss value curve.

Figure 10 .
Figure 10.Comparison of average precision of seven action behavior categories.

Figure 11
Figure 11  shows an example of student action behavior detection at a certain moment in the classroom of the method proposed in this paper, which proves the effectiveness of the method.

Figure 10 .
Figure 10.Comparison of average precision of seven action behavior categories.

Figure 11
Figure 11  shows an example of student action behavior detection at a certain moment in the classroom of the method proposed in this paper, which proves the effectiveness of method.

Figure 11 .
Figure 11.Example of improved AIA network detection.

Figure 11 .
Figure 11.Example of improved AIA network detection.

Figure 13 .
Figure 13.mAP curves for different models.

2 .
The seat matching result at a certain point in time is shown in Figure14.The seat is denoted as ( ) , xy , where x stands for rows of seats and y stands for columns of seats.

Figure 14 .
Figure 14.Example of student seat matching.

Figure 13 .
Figure 13.mAP curves for different models.

Figure 13 .
Figure 13.mAP curves for different models.
2. The seat matching result at a certain point in time is shown in Figure 14.The seat is denoted as ( ) , xy , where x stands for rows of seats and y stands for columns of seats.

Figure 14 .
Figure 14.Example of student seat matching.

Figure 14 .
Figure 14.Example of student seat matching.

Table 3 .
Comparison of adding different modules and replacing loss functions.

Table 3 .
Comparison of adding different modules and replacing loss functions.

Table 5 .
The sequence of students' action behavior in the teaching time of knowledge point k.