TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven

Cai, Ting; Xiong, Yu; He, Chengyang; Chen, Lulu

doi:10.3390/electronics15020460

Open AccessArticle

TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven

¹

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

Artificial Intelligence and Intelligent Education Research Center, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 460; https://doi.org/10.3390/electronics15020460

Submission received: 18 December 2025 / Revised: 10 January 2026 / Accepted: 14 January 2026 / Published: 21 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of high fine-grained similarity and background interference in recognizing teacher teaching behaviors (TTB), this paper proposes a multi-modal network, TBRNet, aiming to improve recognition performance and facilitate teaching reflection. TBRNet leverages CLIP as a semantic prior and introduces two key mechanisms: the Cascaded Collaborative Attention (CCA) module and the Dynamic Query-Driven (DQD). The CCA module performs bidirectional fusion of temporal and semantic features to capture the temporal contextual information of teaching behaviors; the DQD mechanism, using gated semantic prototypes, adaptively focuses on key discriminative regions, improving the model’s ability to distinguish subtle behavioral differences. On the specialized TBU dataset, TBRNet outperforms all baseline models, achieving a Top-1 accuracy of 86.4%. On the public benchmarks UCF-101 and HMDB-51, TBRNet achieves remarkable accuracy rates of 95.8% and 81.7%, respectively, which validates its strong generalization capability across different datasets. This study provides an effective method for efficiently identifying teacher teaching behaviors in real classroom environments.

Keywords:

teacher behavior; CLIP; cascaded collaborative attention; dynamic query-driven; semantic guidance

1. Introduction

The analysis of teacher teaching behavior (TTB) is crucial for improving teaching quality and advancing educational development [1]. TTB not only reflects teachers’ teaching strategies, but also embodies their cognitive and emotional states during the teaching process [2,3,4]. Advances in intelligent educational technology have enabled TTB analysis to move beyond traditional manual observation (a method often limited by low efficiency and strong subjecti.vity) toward a more intelligent and refined approach [5].

From the perspective of data input, the recognition of TTB can be divided into two categories: single-modal and multi-modal. Single-modal methods primarily take static RGB images [6,7] or video sequences [8,9] as input. Those relying on static RGB images typically focus on posture analysis but struggle to capture fine-grained teaching behaviors in dynamic classrooms. In contrast, methods utilizing video sequences can better model real classroom scenarios through temporal modeling [10]. However, since they rely on a single source of information, both types of single-modal approaches lack the capability for complete semantic representation of teacher behaviors in complex environments, which ultimately limits their recognition performance. To address this limitation, multi-modal methods have been developed. These methods provide richer semantic representations by fusing multiple data sources such as RGB images, skeleton key points, video sequences, and text information. Representative fusion strategies include combining video and skeleton data [11,12], skeleton key points and single-frame images [13], images and emotional [14], as well as video clips and text information [1]. By integrating complementary cues, these multi-modal fusion strategies are more compatible with the dynamics and complexity of real classrooms and have achieved significant improvements in recognition accuracy [15,16].

Although multi-modal methods represent the current research trend, existing approaches still suffer from several limitations. First, they often lack robustness in complex classroom environments. Scenarios of occlusion, resulting from teacher-student interactions or instructional tools, pose significant challenge to skeletal keypoint-based methods. These approaches are highly dependent on the accuracy of human pose estimation, which often fails under occlusion, resulting in the loss of critical posture features and ultimately leading to recognition failures. Second, the capability for temporal modeling remains insufficient. Existing approaches struggle to effectively capture the continuity and contextual correlation of teaching behaviors over time, making it difficult to distinguish fine-grained behaviors. Third, multi-modal fusion has room for improvement. Current multi-modal inputs mainly focus on skeleton points and visual information. The incorporation of textual modalities is generally limited to affective recognition, without achieving deep semantic alignment and collaborative modeling between textual and behavioral semantics.

To address these challenges, this paper proposes TBRNet for recognizing teacher teaching behaviors. The model leverages the cross-modal capabilities of Contrastive Language-Image Pretraining (CLIP) to incorporate category semantics and global visual semantics as prior knowledge, while adopting TimeSformer [17] as the backbone network for spatiotemporal feature extraction. We design cross-modal representations to achieve efficient alignment and deep fusion of visual motion features and textual semantics. To enhance the temporal representation of fine-grained teaching behaviors, a Cascaded Cooperative Attention (CCA) mechanism is designed. This mechanism employs hierarchical Multi-level Cross Attention (MCA) to achieve bidirectional fusion between raw temporal features and global visual semantic features, ensuring the alignment between behavioral dynamic features and teaching semantics. Meanwhile, a Dynamic Query-Driven (DQD) mechanism is introduced to refine the perception of key discriminative behavioral features. Guided by prototype semantics generated from fusing global visual semantics with category semantics, DQD incorporates a gating module to focus on key features in an input-adaptive manner. In summary, the contributions of this paper are as follows:

We propose TBRNet, a novel multi-modal network for teacher behavior recognition. TBRNet is designed to integrate temporal, visual, and textual semantic information. Through the design of mechanisms for visual-semantic alignment, cross-modal fusion, and prototype guidance, TBRNet effectively tackles the recognition challenges of TTB in real classrooms, including complex backgrounds, severe occlusions, and high behavior similarity.
A Cascaded Cooperative Attention (CCA) is designed. CCA employs a hierarchical multi-level cross-attention architecture to progressively achieve bidirectional fusion between spatiotemporal behavioral features and global visual semantics, thereby improving the temporal modeling of continuous instructional behaviors.
A Dynamic Query-Driven (DQD) mechanism is introduced. Guided by semantic prototypes, DQD utilizes an adaptive gating module to filter discriminative behavioral features, thereby enhancing the perception and recognition of key teaching behaviors in complex scenarios.
Extensive experiments on a dedicated teacher behavior dataset (TBU) and two public benchmarks validate the effectiveness of TBRNet. The results show that our method significantly outperforms existing baselines in recognition accuracy and exhibits strong generalization ability, making it a robust solution for real classroom behavior analysis.

The remainder of this paper is organized as follows: Section 2 briefly reviews the related work. In Section 3, we introduce our proposed TBRNet in detail. Section 4 details the experimental setup and dataset, and presents a comprehensive analysis and discussion of the results. Finally, Section 5 concludes the proposed method and outlines future research directions.

2. Related Work

In this section, single-modal teacher teaching behavior recognition, multi-modal teacher teaching behavior recognition, and CLIP-based action recognition are analyzed in detail.

2.1. Single-Modal Teacher Teaching Behavior Recognition

With the integration of intelligent technology and education, the availability of classroom teaching behavior data has greatly increased. Due to its simplicity in data acquisition and relatively straightforward implementation, single-modal methods for recognizing TTB have emerged as a mainstream approach in early intelligent classroom analysis. Based on the type of input data, these methods can be broadly categorized into two types: image-based input and video-based input. Their specific characteristics and research advancements are summarized as follows.

The image-based input modality focuses on analysis of static spatial postures. Pang et al. [6] localized teacher regions through object detection, extracted skeletal keypoints, and then constructed neighborhood relationships via a graph convolutional network to recognize TTB. Meanwhile, Xu et al. [18] utilized a VGG16 network to extract deep features from single-frame images, incorporating a feature pyramid and convolutional block attention module to enhance representation for TTB classification. In general, while the image-based input modality can capture spatial pose information at specific moments, it fails to capture the temporal continuity of behavioral dynamics.

The video-based input modality focuses on dynamic temporal modeling. Zheng et al. [19] used video data as input, employing ResNet-50 to extract and fuse static features from multiple frames for teacher head pose recognition. However, this method lacked specialized temporal modeling modules, failing to capture inter-frame variations. Zhao et al. [8] proposed an improved behavior recognition network based on three-dimensional bilinear pooling (3D BP-TBR). This network relies on the accurate identification of the “teacher set” spatial region, which ties recognition performance to the teacher detection results and reduces adaptability in occluded scenarios. Yuvaraj et al. [9] designed a long-term recurrent convolutional network to capture spatiotemporal features for identifying teaching behaviors. Peng et al. [10] employed Faster R-CNN to localize the teacher region, adopted PoseC3D as the backbone network, and integrated a spatiotemporal attention mechanism with a dilated convolution module to better capture the spatiotemporal information of TTB. However, this approach overlooked the issue of feature loss caused by occlusions from students or teaching aids. Overall, the video-based input modality overcomes the limitations of static images by introducing temporal modeling. However, existing methods often rely on an idealistic assumption of a clear, fixed, and unobstructed teacher view. This assumption differs from the complex real-world classroom environment, where teachers frequently move around and are easily obstructed.

2.2. Multi-Modal Teacher Teaching Behavior Recognition

In contrast to single-modal approaches, multi-modal research for TTB recognition primarily evolves along two directions. The first focuses on multi-modal behavior analysis for teaching evaluation, which aims to construct comprehensive evaluation systems by identifying multi-dimensional behaviors such as teacher emotions, gestures, and body orientation [1,20,21,22]. Studies in this category often employ conventional multi-class behavior recognition methods, prioritizing the implementation of evaluative functions over recognition accuracy. The second direction concentrates on multi-modal data fusion for enhanced recognition performance. It seeks to improve the accuracy and robustness of TTB recognition by integrating multi-source data (e.g., visual, skeletal, and auditory). Since the former direction is less relevant to the technical objectives of this study, this section focuses on reviewing the latter.

Multi-modal fusion for TTB can be categorized into two levels: intra-visual multi-modal fusion and cross-modal fusion. Intra-visual fusion integrates multi-dimensional visual cues from video data to construct comprehensive behavioral representations. Wu et al. [11] proposed a method combining RGB frames and skeletal data. Their approach used optical flow to extract hand-crafted motion features (e.g., HOG, HOF, MBH), which were then combined with skeletal joint features. However, this method relied on manual feature design and short-term temporal modeling, making it suitable for short-term action recognition in scenarios with minimal background interference. Ma et al. [23] explored the multi-dimensional fusion of skeletal information by extracting joint coordinates, bone structures, and motion temporal features. They constructed a multi-stream graph convolutional network and incorporated temporal attention. Nevertheless, the absence of positional encoding in the attention mechanism limited its capacity to model the temporal sequential logic of teaching behaviors. Wu et al. [13] introduced a spatiotemporal dual-branch architecture that fused image-based spatial features with temporal human keypoint heatmaps. A notable limitation of this method is its performance degradation in occluded scenarios. Cross-modal fusion methods aim to integrate visual and non-visual modal data to improve the performance of TTB. Although cross-modal fusion is crucial for TTB recognition [15], current research remains largely confined to teacher-student interaction scenarios, primarily relying on the integration of visual and auditory cues [24,25]. Research on more challenging fine-grained behaviors, such as interacting with students on the podium or walking around beneath, is still limited and faces issues like poor adaptability to complex scenarios and insufficient information utilization.

2.3. CLIP-Based Action Recognition

Research on CLIP-based video action recognition focuses on two key areas: the interaction and alignment of multimodal information, and the enhancement of visual sequence modeling.

In the aspect of information interaction and cross-modal alignment, the research focus lies on integrating temporal, textual, and visual information of different granularities. Ni et al. [26] validated the transfer potential of CLIP to video tasks by aligning cross-frame fused features with category semantics. The BIKE model, proposed by Wu et al. [27], integrates video attributes, category semantics, and temporal features, achieving more robust recognition through dual textual semantic contrast. Meanwhile, Wang et al. [28] systematically constructed cross-modal alignment by designing independent prompts for vision and text. In enhancing visual sequence modeling, researchers aim to extract more discriminative spatiotemporal features to obtain finer-grained information for modality interaction. Wasim et al. [29] achieved fine-grained mining of temporal information and cross-modal matching by constructing global and local visual prompts. Wang et al. [30] designed dedicated spatiotemporal Transformer branches and attention modules to actively generate more discriminative video representations. Zhang et al. [31] enhanced feature discriminability by fusing multi-level visual information with textual semantics and leveraging the resulting representation as query vectors. To capture diverse temporal dynamics and spatial variations, Liu et al. [32] proposed an integrated Spatio-Temporal Prompts (STOP) model. The model comprised intra-frame spatial prompts and inter-frame temporal prompts, and their integration was designed to enhance the model’s dependency modeling across multiple timescales.

In summary, leveraging CLIP’s powerful vision-text alignment capability, significant progress has been made in fine-grained semantic modeling and cross-modal alignment for CLIP-based video action recognition. However, research remains relatively scarce for recognizing teacher instructional behaviors in classroom scenarios. This scenario presents multiple challenges, including complex occlusions and subtle actions, demanding models with more refined semantic understanding and dynamic collaborative capabilities.

To address this, this paper proposes the TBRNet model specifically for TTB recognition. While inheriting the advantages of the CLIP dual-encoder architecture and semantic guidance, TBRNet further introduces a Cascaded Collaborative Attention mechanism and a Dynamic Query-Driven mechanism. These designs aim to achieve fine-grained collaborative learning of multimodal semantics, providing a more effective solution for recognizing teaching behaviors in complex classroom scenarios.

3. Methodology

This section begins with a brief overview of our framework, followed by a detailed introduction to its three core modules: the video and text encoder, the cross-modal interaction module, and the dynamic query decoding module.

3.1. Overview of TBRNet

The proposed TBRNet is illustrated in Figure 1. It consists of three main components: (1) Video and text encoders. This component includes a semantic extraction module based on CLIP and a video encoder based on TimeSformer. The semantic module encodes video frames using the CLIP image encoder and processes category labels using the CLIP text encoder, obtaining global visual semantics and categorical textual semantics. The video encoder, built upon TimeSformer, models temporal information to extract discriminative spatiotemporal features. (2) Cross-modal interaction. This component consists of semantic prototype generation and a Cascaded Cooperative Attention (CCA) mechanism (CCA). The semantic prototype generation module constructs prototypes that align visual and textual semantics, serving as semantic anchors for cross-modal fusion. The CCA performs hierarchical cross-attention fusion between temporal behavioral features and global visual semantics, enhancing the representation of behavioral temporal context. (3) Dynamic query decoding. This component includes a Dynamic Query-Driven (DQD) module and a decoding classifier. Guided by semantic prototypes, DQD processes cross-modal fused features to generate dynamic query vectors that incorporate both video content and target category information. In the decoding stage, a Transformer decoder is employed for the purpose of learning category-sensitive feature representations, with the dynamic query vectors acting as queries and the CCA module’s output serving as keys/values. Finally, a grouped linear classification layer is used to assign independent projection matrices for each category, outputting multi-label recognition results.

3.2. Video and Text Encoder

3.2.1. TimeSFormer-Based Video Encoder

The pre-trained TimeSformer is used as the visual backbone network to capture temporal contextual information. A video

V \in R^{T \times 3 \times H \times W}

is divided into N non-overlapping patches of uniform size per frame. These patches are then linearly projected into embedding vectors and combined with learnable spatiotemporal positional encodings to form the initial input tokens

z_{(p, t)}^{(0)}

for the transformer encoder, as defined by:

z_{(p, t)}^{(0)} = W_{(p, t)} \cdot x_{(p, t)} + e_{(p, t)}^{{pos}^{D^{'}}}

(1)

where

x_{(p, t)}

denotes the pixel values of the p-th patch in the t-th frame,

e_{(p, t)}^{{pos}^{D^{'}}}

represents the learnable spatiotemporal positional encoding, and

W_{(p, t)}

is a learnable projection matrix.

The initial embedding vector is fed into the L-layer TimeSformer encoder. Each encoder layer contains a multi-head space-time attention (MSTA) and a feed-forward network (FFN). The output sequence of the l-th layer encoder is described as follows:

\begin{matrix} Z^{(l)} = FFN (LayerNorm (Z^{(l)})) + Z^{(l)} \end{matrix}

(2)

where

Z^{(l)} \in R^{(T \times N + 1) \times D^{'}}

.

From the final encoder layer, the class token

v_{cls} = z_{(0, 0)}^{(L_{v})}

and the mean-pooled patch features

z_{(p, t)}^{(L_{v})}

of each frame are extracted. They are then projected into a unified dimension D using learnable matrices to produce the classification tag

v_{0} = W_{c l s} \cdot v_{cls}

and the frame feature

v_{t} = W_{v} \cdot \frac{1}{N} \sum_{p = 1}^{N} z_{(p, t)}^{(L_{v})}

.

Subsequently, all frame features are concatenated with the classification token, and temporal position encodings are added to form the final temporal feature sequence

X_{t}

. This process enables the model to effectively capture and represent the temporal structure of the input video.

3.2.2. CLIP-Based Visual Encoder

3.2.3. CLIP-Based Visual and Text Encoder

The CLIP image encoder is employed to extract high-level visual features that exhibit strong semantic consistency, benefiting from its contrastive pre-training paradigm. This encoder effectively preserves visual-semantic associations across diverse scenes [33], enhances representation accuracy in dynamic contexts, and provides rich global semantic information for subsequent action recognition tasks. The feature extraction process is defined as follows:

F_{c}^{i} = C L I P_{visual} (V [:, i]) \in R^{D}, i = 1, \dots, T .

(3)

where

V [:, i]

denotes the i-th frame of the input video.

Each frame is processed independently by the CLIP image encoder to obtain a high-level feature vector

F_{c}^{i} \in R^{D}

. To aggregate these frame-wise features into a compact video-level representation, temporal average pooling followed by a dimension adaptation operation is applied. The formula is as follows:

X_{c} = Adapter (\frac{1}{T} \sum_{i = 1}^{T} F_{c}^{i})

(4)

where Adapter denotes a linear layer or other lightweight projection network, and

X_{c} \in R^{D}

represents the resulting global visual feature vector of the video. By leveraging the CLIP image encoder, the model gains improved comprehension of holistic contextual information within the video.

3.2.4. CLIP-Based Text Encoder

For each video behavior category, a descriptive text prompt such as “A video of a teacher [action]” is constructed, where [action] denotes the specific behavior category. These textual descriptions are fed into the CLIP text encoder to obtain the corresponding semantic embedding

{\{e_{k}\}}_{k = 1}^{K} \in R^{K \times D}

for each category, where K denotes the total number of categories. The CLIP text encoder effectively maps text descriptions into a high-dimensional semantic space consistent with visual features, providing a robust semantic prior for cross-modal understanding and reasoning tasks.

3.3. Cross-Modal Interaction

3.3.1. Cascaded Collaborative Attention

In classroom teaching scenarios, the rich spatiotemporal dynamics inherent in TTB are crucial for addressing the core challenge of fine-grained recognition—namely, “large intra-class variations and small inter-class differences”. TTB exhibits strong temporal logic and context dependence. For instance, “lecture underneath the podium” and “walking around on the podium” are two distinct behaviors. Differentiating them requires not only perceiving the teacher’s location but also discerning subtle, category-specific postural cues such as gestures and gaze direction during explanation. Therefore, to accurately identify such fine-grained behaviors, the model must be able to understand the temporal evolution of the behavior and precisely associate it with specific teaching semantics.

To capture deep spatiotemporal context, this paper proposes a Cascaded Cooperative Attention (CCA) module, as illustrated in Figure 2. The core of CCA is to enable bidirectional, hierarchical interaction between temporal features and global semantic features. Specifically, in each layer, temporal features first act as queries, using global semantic features as context to refine themselves; subsequently, the updated semantic features, in turn, act as queries to extract the most relevant dynamic information from the temporal features. Through this alternating “dialogue”, the model can gradually establish fine-grained temporal-semantic alignment, thereby enhancing its understanding of behavioral context.

The CCA module takes temporal information

X_{t}

and global visual semantics

X_{c}

as dual inputs. It adopts a hierarchical architecture with L stacked identical layers, where each layer enables bidirectional cross-modal refinement through two core Multi-level Cross-Attention (MCA) operations. For the l-th layer, the process unfolds as follows:

(1) First MCA operation: temporal refinement. It takes the previous layer’s spatiotemporal features

H_{t}^{(l - 1)}

as Query (Q), and the previous layer’s semantic features

H_{c}^{(l - 1)}

as both Key (K) and Value (V). This allows temporal features to “attend to” relevant semantic cues, filtering and enhancing fine-grained temporal dynamics that align with global teaching semantics. The output is a preliminarily refined temporal feature

{\tilde{H}}_{t}^{(l)}

.

{\tilde{H}}_{t}^{(l)} = Softmax (\frac{H_{t}^{(l - 1)} \cdot {(H_{c}^{(l - 1)})}^{T}}{\sqrt{d_{k}}}) H_{c}^{(l - 1)}

(5)

where

d_{k}

denotes the feature dimension of Query and Key, used to scale attention scores.

(2) Second MCA operation: semantic update. It reverses the interaction flow: the previous layer’s semantic features

H_{c}^{(l - 1)}

serve as Query (Q), while the refined temporal feature

{\tilde{H}}_{t}^{(l)}

(output of the first MCA) acts as both Key (K) and Value (V). This injects the updated temporal dynamics back into the semantic space, refining abstract semantics into more behavior-specific representations. The output is a preliminarily updated semantic feature

{\tilde{H}}_{c}^{(l)}

.

{\tilde{H}}_{c}^{(l)} = Softmax (\frac{H_{c}^{(l - 1)} \cdot {({\tilde{H}}_{t}^{(l)})}^{T}}{\sqrt{d_{k}}}) {\tilde{H}}_{t}^{(l)}

(6)

Finally, the output of layer L is

H_{t}^{L}

and

H_{c}^{L}

.

Through stacked MCA layers, CCA enables bidirectional fusion between global semantics and spatiotemporal features, enhancing sensitivity to subtle behavioral dynamics. After L layers, a gated fusion mechanism aggregates the top-level representations. The formula is as follows:

g = σ (W_{g} [H_{t}^{L}; H_{c}^{L}] + b_{g})

(7)

where

σ

is the sigmoid activation function,

[;]

denotes vector concatenation,

W_{g}

and

b_{g}

are learnable parameters.

The fused feature representationis

H_{fuse}

obtained by a weighted combination using the gating weights. The final output feature

X_{final}

is constructed by concatenating the fused feature

H_{fuse}

with the two original high-level features

H_{t}^{L}

and

H_{c}^{L}

, followed by a linear projection. The formula is as follows:

\begin{matrix} H_{fuse} & = g \cdot H_{t}^{L} + (1 - g) \cdot H_{c}^{L} \end{matrix}

(8)

\begin{matrix} X_{final} & = W_{out} [H_{fuse}, H_{t}^{L}, H_{c}^{L}] \end{matrix}

(9)

where

H_{fuse} \in R^{(T + 1) \times D}

,

X_{final} \in R^{(T + 1) \times D}

,

W_{out}

is a learnable projection matrix.

This design preserves fine-grained details while enhancing discriminative power, providing robust spatiotemporal semantics for subsequent recognition.

3.3.2. Semantic Alignment Prototype

This module leverages the pre-trained CLIP model as a semantic prior to generate feature-level semantic prototypes. These prototypes capture semantic correlations between visual features and textual embeddings, guiding subsequent dynamic queries to focus on discriminative behavior-specific features. The generation process consists of two steps.

(1) Weighted fusion based on semantic relevance. The semantic relevance between global visual features and all category text embeddings is computed. The normalized relevance scores form a weight distribution, which is used to produce a weighted sum of category embeddings, yielding a text-informed semantic vector.

P_{text} = \sum_{k = 1}^{K} Softmax (\frac{X_{c} \cdot e_{k}}{\sqrt{D}}) \cdot e_{k}

(10)

where K is the number of categories, D is the feature dimension,

X_{c}

is the global visual feature, and

e_{k}

is the k-th category text embedding.

(2) Multi-modal feature alignment and fusion. The text semantic vector

P_{text}

is projected to the visual feature space and combined with the global visual feature via residual connection, producing the final semantic prototype:

P = LayerNorm (X_{c} + W_{p} \cdot P_{text}))

(11)

where

W_{p}

is a learnable projection matrix, and

P \in R^{D}

is the resulting multi-modal semantic prototype.

3.4. Dynamic Query Decoding

3.4.1. Dynamic Query Generation

To enhance the recognition of key teaching behaviors, a Dynamic Query Decoder (DQD) mechanism is introduced. The core idea is to leverage the powerful semantic priors of the CLIP model to generate highly discriminative, input-adaptive query vectors. To enable adaptive feature selection, the DQD module first fuses the global visual information of the video with semantic prototype information to generate a basic query vector for each category of teacher behaviors. A lightweight gating network then dynamically assesses the relevance of each category to the current input video, producing a set of gating coefficients. These coefficients scale the base queries, effectively amplifying those for highly relevant categories while suppressing others. The resulting dynamic queries guide the subsequent Transformer decoder to attend specifically to the spatiotemporal regions most indicative of the pertinent behaviors, allowing the model to filter out irrelevant information and focus on the most discriminative features in complex scenes. The process consists of two main steps.

(1) Generation of base query vectors. Global semantics are extracted from the CCA output features via temporal average pooling and fused with the feature semantic prototype P to produce a preliminary base query vector for each category:

q_{k}^{base} = LayerNorm (W_{q} [V_{g} \oplus P] + b_{q})

(12)

where

V_{g} = Pool (X_{final} [:, 0, :]) \in R^{D}

represents the global visual feature,

P \in R^{D}

is the semantic prototype,

K

is the total number of categories,

W_{q}

and

b_{q}

are learnable parameters, ⊕ denotes vector concatenation, and

q_{k}^{base} \in R^{D}

is the base query vector for the K-th category.

(2) Dynamic feature selection via a gating mechanism. The global visual features

V_{g}

is used to generate a gating vector g that estimates the relevance of each category to the current input:

g = σ (W_{v} V_{g} + b_{q})

(13)

where

W_{v} \in R^{K \times D}

and

b_{q} \in R^{K}

are learnable weight and bias parameters of the gating layer,

σ

denotes the sigmoid function,

g \in R^{K}

is the gating vector,

g_{k} \in (0, 1)

represents the K-th vector value and

q_{k} = g_{k} \cdot q_{k}^{base}

is the gated query vector for the k-th category. The final dynamic query matrix is formed by stacking the K gated query vectors

Q = {[q_{1}, q_{2}, \dots, q_{k}]}^{T} \in R^{K \times D}

.

3.4.2. Decoding and Classification

The generated dynamic query matrix Q is input into a Transformer decoder, while the visual features

V_{g}

serve as Key (K) and Value (V). Within the decoder, the queries interact with

V_{g}

via cross-attention to capture rich spatiotemporal context. The decoder adopts a standard 3-layer Transformer architecture, outputting the final result

H_{D}

.

H_{D} = TransformerDecoder (Q_{d}, V_{g}, V_{g}) \in R^{K \times D}

(14)

where

H_{D} \in R^{K \times D}

. The output matrix

H_{D}

contains K feature vectors, each enriched with visual context and guided by semantic prototypes.

To prevent parameter interference across different categories, a grouped linear classification layer is employed. Each enhanced query vector

h_{k}

in

H_{D}

corresponding to category k is processed by an independent classification head to compute its confidence score.

{\hat{y}}_{k} = W_{k}^{T} \cdot \emptyset (h_{k}) + b_{k}, k = 1, \dots, K

(15)

where ∅ denotes a nonlinear activation function,

W_{k}^{T} \in R^{D}

and

b_{k} \in R^{D}

are learnable parameters for the k-th category. The final output is the prediction vector

{\hat{y}}_{k}

for all K categories.

3.5. The Loss

The Binary Cross-Entropy (BCE) Loss is adopted as the optimization objective for video action classification. The TBU dataset is a single-label dataset, where each video clip corresponds to exactly one fine-grained teaching behavior category. This loss measures the discrepancy between predicted probabilities and ground-truth labels, guiding the model toward more accurate classification boundaries. The ground-truth labels for input videos are provided in multi-hot encoding format, while the model outputs normalized logits for each class. Each logit is independently transformed into a probability estimate using the sigmoid function. The BCE loss is then computed between these probabilities and the ground truth.

L_{B C E} = - \frac{1}{K} \sum_{k = 1}^{K} [y_{k} \cdot log (σ ({\hat{y}}_{k})) + (1 - y_{k}) \cdot log (1 - σ ({\hat{y}}_{k}))]

(16)

where

σ

denotes the sigmoid function,

y = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{K}] \in {0, 1}^{K}

is the multi-hot ground-truth label vector.

4. Results and Analysis

4.1. Experimental Protocols

4.1.1. Datasets and Evaluation

(1) Datasets

The proposed method is evaluated on three datasets: TBU [34], UCF-101 [35] and HMDB-51 [36]. A subset of the TBU dataset [34] is utilized in our experiments. The full dataset is collected from 200 real classroom teaching videos across four educational stages, captured from diverse perspectives and with various recording devices, reflecting authentic classroom scenarios. It contains 28.6 K annotated action clips, each lasting 2–90 s, covering 13 fine-grained teaching behavior categories. In this work, we employ a subset of this data, which is split into training and test sets at a ratio of approximately 3:1, containing 17,762 and 728 clips, respectively, facilitating effective model training and generalization assessment. Figure 3 illustrates the examples from the TBU dataset. The TBU dataset presents two major challenges for TTB recognition in real classroom scenarios. First, diverse camera angles and complex classroom environments lead to significant occlusion of the teacher. Second, the behaviors exhibit fine-grained similarity, characterized by high intra-class variation and low inter-class variation. These challenges demand a model with capability for robust spatiotemporal reasoning and high sensitivity to discriminative features.

UCF-101 [35] is a widely adopted benchmark for action recognition, consisting of 13,320 real-world videos from 101 fine-grained action categories. It exhibits substantial variations in motion, viewpoint, illumination, and camera conditions. Each category includes at least 100 videos, with an average duration of 7.2 s. The dataset provides three official training/testing splits, totaling approximately 9.5 K training and 3.7 K test samples.

HMDB-51 [36] has a total of 6766 video samples with 51 categories. The videos are sourced from diverse unstructured recordings, exhibiting challenging conditions such as camera motion, viewpoint changes, cluttered backgrounds, and partial occlusions, which closely mirror real-world complexity. The dataset is officially divided into 3 splits, each split has a pair of training and validation sets.

(2) Evaluation

This paper adopts several widely-used evaluation metrics for action recognition, including Class-wise Top-1 Accuracy, Mean Accuracy, and Macro-average F1-Score. Class-wise Top-1 Accuracy evaluates the recognition precision for individual action categories. Mean Accuracy represents the arithmetic mean of the per-class Top-1 accuracies. The Macro F1-Score, computed as the harmonic mean of precision and recall, provides a balanced assessment across all classes, particularly useful under class imbalance. The formal definitions of these metrics are provided below.

\begin{matrix} A c c_{k} & = \frac{N_{correct}^{k}}{N_{total}^{k}} \times 100 % \end{matrix}

(17)

\begin{matrix} {Acc}_{mean} & = \frac{1}{C} \sum_{k = 1}^{C} {Acc}_{k} \end{matrix}

(18)

\begin{matrix} {Precision}_{k} & = \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k}} \end{matrix}

(19)

\begin{matrix} {Recall}_{k} & = \frac{{TP}_{k}}{{TP}_{k} + {FN}_{k}} \end{matrix}

(20)

where

{TP}_{k}, {FP}_{k}

and

{FN}_{k}

denote the true positives, false positives, and false negatives for class k, respectively. K is the total number of classes.

N_{correct}^{k}

is the number of correctly classified samples in class k, and

N_{total}^{k}

is the total number of samples in class k.

4.1.2. Comparison Models and Implementation Details

(1) Comparison models

To validate the effectiveness of the proposed model, we performed comparative experiments with several representative methods. C3D [37] is a 3D CNN-based video action recognition model that extracts spatiotemporal features directly via 3D convolutional kernels. TSN [38] (Temporal Segment Network) improves recognition efficiency and robustness by sparsely sampling video segments and aggregating predictions from multiple segments. I3D [36] adopts a two-stream inflated 3D CNN architecture, integrating RGB appearance information and optical flow motion information to achieve joint spatiotemporal modeling. SlowFast [39] is a dual-path spatiotemporal modeling framework: its Slow pathway captures spatial semantics with high resolution, while the Fast pathway captures motion information at low resolution. SoSR+ToSR [40] integrates spatial-oriented super-resolution (SoSR) and temporal-oriented super-resolution (ToSR) techniques to enhance video clarity, thereby supporting more accurate action recognition. SVT [41] is a self-supervised video Transformer that learns general-purpose video representations through pretext tasks without relying on manual annotations. MSM-ResNet [42] is a multi-stream ResNet architecture, incorporating appearance, motion, and motion saliency branches to realize complementary feature learning. TimeSformer [17] is a Transformer-based video recognition model that captures long-range spatiotemporal dependencies via divided space-time attention. BIKE [27] is a CLIP-based approach, achieving cross-modal action recognition through video-attribute association and temporal concept localization. MCB [30] uses spatial and temporal Transformers for feature extraction and fusion, incorporates CLIP text semantics, and adopts a dual-attention mechanism to enable collaborative learning and classification. STOP [32] improves the perception of diverse spatiotemporal variations within temporal segments by combining intra-frame attention-based spatial prompts and frame-similarity-based dynamic temporal prompts.

(2) Implementation details

The experiments were implemented using PyTorch 2.1.0 under Python 3.8. The Adam optimizer was used for fine-tuning with a base learning rate of

5 \times 10^{- 5}

, and L2 weight decay of

1 \times 10^{- 4}

to improve generalization. The learning rate was scheduled using a cosine annealing warm-restart strategy, with an initial cycle length of 10 epochs and a cycle multiplier of 2. Model training was performed on 4 GeForce GTX 3090 Ti GPUs with a batch size of 32 for 100 epochs. The CCA module consists of 4 layers of multi-head attention with 8 attention heads each. The feature output dimension of both the CLIP visual encoder and the TimeSformer network is 1024, while the output dimension of the CLIP text encoder is 512. The complete TBRNet architecture contains approximately 140 million parameters. An inference-time estimate was derived based on the model architecture. On a single NVIDIA GeForce RTX 3090 Ti GPU with a batch size of 1, processing a 16-frame video clip takes approximately 80–120 ms, corresponding to roughly 8–12 FPS.

For fair architectural comparison, we re-trained the most comparable models (TimeSformer, TSN, BIKE and STOP) on our TBU dataset under identical conditions: same data splits, optimizer settings, batch size, epoch count, and a unified segment-based sampling strategy (4 segments × 4 frames = 16 frames). The results for other classical baseline models (e.g., C3D, I3D) are reported as per their original publications to maintain consistency with widely established benchmarks.

4.2. Comparison with Previous Works

4.2.1. Comparison of Different Models on TBU

Table 1 shows the recognition performance of different models on the TBU dataset. The proposed TBRNet model achieves the best results across all metrics, with a Top-1 accuracy of 86.4%, a mean accuracy of 82.1%, and an F1-score of 81.8%, demonstrating its effectiveness in handling complex classroom scenarios.

Among the comparison methods, conventional 3D CNNs such as C3D and I3D show limited performance in mean accuracy and F1-score, suggesting their inadequacy in capturing fine-grained temporal dynamics in complex environments. While the SlowFast network and TSN bring moderate improvements through dual-path modeling and temporal segmentation, their gains remain constrained. In contrast, the TimeSformer model, leveraging spatiotemporal self-attention, achieves notable performance (83.4% Top-1 accuracy, 74.3% F1-score), underscoring the advantage of modeling long-range dependencies. The BIKE model further advances the results via cross-modal fusion, reaching 83.3% Top-1 accuracy and higher mean accuracy and F1-score, which highlights the benefit of integrating complementary video and text cues. However, the performance of the BIKE model is still lower than that of STOP and TBRNet. This may be because BIKE relies on global similarity calculations and clustering, making it difficult to capture sufficient discriminative features, especially when dealing with teaching behaviors that have complex backgrounds, subtle actions, and long-term dependencies. STOP achieves better performance than BIKE by designing inter-frame temporal prompts and intra-frame spatial prompts. However, its results still fall short of TBRNet. This may be because, in complex classroom scenarios, STOP’s fine-grained spatial and temporal prompts introduce additional redundant information.

TBRNet achieves the best performance on the TBU dataset, primarily due to two key design aspects. First, the CCA module enables bidirectional and progressive fusion between stable global visual semantics (from the CLIP image encoder) and dynamic local spatiotemporal features (from TimeSformer). This design allows the two types of information to serve as mutual context, enhancing the model’s understanding of key temporal evolution in complex scenes. Second, the DQD module uses prototype semantics constructed from global visual semantics and category text semantics as its core guidance, adaptively filtering and enhancing the fused features output by CCA, thereby precisely focusing on the most discriminative behavioral patterns. Overall, the results validate that TBRNet, by incorporating fine-grained semantic guidance and cross-modal interaction, offers a more robust solution for teacher behavior recognition under challenging conditions.

4.2.2. Comparison of Different Models on UCF101 and HMDB51

Table 2 presents a comparative analysis of Top-1 accuracy on the UCF101 and HMDB51 datasets. Experimental results indicate that models incorporating attention mechanisms and temporal modeling consistently outperform traditional methods. The conventional 3D convolutional model C3D achieves a baseline accuracy of 90.1%. The I3D model, which enhances temporal modeling through dilated 3D convolutions, improves performance to 95.1%. Among two-stream architectures, TSN achieves 94.2% accuracy using segment-based sampling, while SoSR+ToSR performs slightly worse due to its focus on super-resolving low-resolution inputs. Attention-based models exhibit particularly outstanding performance. SVT attains 93.7% accuracy through global spatiotemporal attention mechanisms. MCB achieves the highest accuracy on UCF101 at 96.3%, which benefits from the semantic matching attention and semantic allocation attention modules. STOP underperforms MCB on both datasets, likely because it focuses more on action segments with significant spatiotemporal variations. The proposed TBRNet demonstrates competitive performance, achieving accuracy comparable to MCB on UCF101 and state-of-the-art results on HMDB51. These results demonstrate the strong generalization capability and applicability of TBRNet in broader action recognition scenarios.

4.2.3. Evaluation of Recognition Accuracy of Each Category

Table 3 presents the recognition accuracy for each category on the TBU dataset, with the corresponding confusion matrix visualized in Figure 4. As can be seen, the proposed model achieves excellent performance on most categories, though several exhibit comparatively lower recognition rates.

Among these, the “operate multimedia” category shows notably lower accuracy. Visually, this behavior typically involves subtle, low-salience motions and exhibits high pattern overlap with “multimedia teaching”. Even when differential gestures exist, their feature distinctions remain limited. Additionally, this behavior often co-occurs spatially and temporally with “lecture on the podium”, further complicating discrimination. As a result, the model frequently misclassifies “operate multimedia” as either “multimedia teaching” or “lecture on the podium”. The results indicate that there is a characteristic of semantic overlap between different teaching behaviors in real classroom environments [34]. Similarly, the “walking around on the podium” category is prone to confusion with “lecture on the podium”, as their frame-level visual features are highly similar. The key to distinguishing these two types of behaviors lies in whether they are accompanied by lecturing. This requires reliance on audio modality information, making reliable classification difficult based solely on visual features.

The in-depth analysis of the challenging cases mentioned above also reveals two core challenges to the robustness of the TBRNet model: (1) extremely high inter-class visual similarity with a scarcity of discriminative features; (2) insufficient integration of multi-modal cues, where critical disambiguating information lies beyond the visual domain. These are not simply failures, but rather characterizations of the model’s failure modes and the boundaries of its visual robustness. Although the proposed architecture enhances fine-grained recognition capability, achieving full robustness in real-world classroom scenarios will require further research efforts, such as the integration of complementary modalities and the design of more sophisticated temporal reasoning mechanisms.

Despite these complexities inherent in real classroom settings, the proposed model maintains strong recognition performance on other fine-grained behaviors. Moreover, TBRNet achieves competitive results on public benchmarks, demonstrating its robustness and generalization capability across diverse scenarios.

4.3. Ablation Study

4.3.1. Ablation Study of Different Modules

To evaluate the contribution of each component in the TBRNet model, we conducted a series of ablation experiments, with results summarized in Table 4. The pre-trained TimeSformer (TS) was used as the baseline, and the Cascaded Cooperative Attention (CCA) module and the Dynamic Query-Driven (DQD) module were incrementally incorporated.

The results indicate that the baseline model solely using TS achieved an accuracy of 83.42%. Introducing the CCA module alone improved accuracy by 2.53%. This performance gain is mainly attributable to the bidirectional temporal context modeling mechanism of the CCA. Through the interaction between temporal features and global semantics, it explicitly establishes the continuity and logical dependency relationships of teaching behaviors (e.g., distinguishing between lecture on the podium and walking around on the podium), thereby addressing the critical issue that pure temporal models struggle to effectively capture behavioral contextual correlations. Introducing the DQD module alone improved accuracy by 1.34%. The core contribution of this module lies in its precise adaptive feature selection capability. The DQD module dynamically focuses on key discriminative regions via semantic gating (e.g., teachers’ head and hand poses in the lecture on the podium behavior). It not only effectively filters out redundant information in complex backgrounds but also markedly boosts the model’s sensitivity to fine-grained behavioral differences.

The combined use of CCA and DQD achieves the best accuracy of 86.40%. This result clearly demonstrates that the integration of these two modules produces a significant synergistic effect: the context-aware feature fusion capability of CCA and the semantic-guided feature discrimination capability of DQD complement each other precisely, jointly constructing a more efficient framework for teacher behavior recognition. It further verifies the core complementarity between the two modules—CCA deciphers the temporal patterns of behavioral evolution, while DQD localizes the core regions of key discriminative evidence. The two modules complement each other and jointly constitute a robust framework that integrates both behavioral understanding and feature discrimination capabilities.

4.3.2. Ablation Study on Different Modal Inputs

Ablation studies using different combinations of multi-modal features were conducted to further evaluate the contribution of each modal input, as shown in Table 5. Spatiotemporal representations (ST-R) were extracted using the TimeSformer network, while global visual semantics (GV-S) and textual category semantics (TC-S) were obtained from the CLIP visual and text encoders, respectively. The results show that incorporating GV-S alone increased the accuracy to 84.69%, demonstrating strong complementarity between global visual semantics and spatiotemporal features. Most notably, introducing TC-S as semantic prior knowledge led to a substantial performance improvement of 85.78%. This gain significantly exceeded that from visual semantics alone, highlighting the critical role of textual semantics in fine-grained action discrimination. Ultimately, the integration of both GV-S and TC-S achieved the optimal accuracy of 86.40%, confirming that spatiotemporal features, visual semantics and textual priors collectively form a more discriminative multi-modal representation with synergistic enhancement effects.

5. Conclusions and Future Work

To address the recognition challenges of TTB in complex real-world classroom environments, this paper proposes TBRNet, a novel multi-modal behavior recognition network. Based on CLIP and TimeSformer, TBRNet constructs a cross-modal representation space to enhance visual-text semantic alignment. A Cascaded Cooperative Attention (CCA) mechanism is designed to achieve bidirectional fusion of temporal behavioral features and visual semantic information. Furthermore, a Dynamic Query-Driven (DQD) module is introduced, which adaptively focuses on discriminative features under the guidance of semantic prototypes, thereby significantly improving recognition robustness in complex scenarios. Experiments on the dedicated TBU dataset and two public benchmarks demonstrate that TBRNet outperforms existing methods in recognition accuracy and exhibits strong generalization capability.

From the technical implementation perspective, the method proposed in this paper aligns with mainstream cross-modal learning by building upon CLIP with domain adaptation. It should be acknowledged that directly applying CLIP, which is pre-trained on general web data, to the educational domain may result in a semantic domain gap. The generic visual-text associations CLIP learns may not precisely align with the fine-grained, pedagogy-specific semantics of classroom behaviors. This semantic mismatch problem will directly restrict the matching degree between the prior knowledge of the pre-trained model and the refined behavioral definitions required by this task.

To address these limitations, future work will focus on four directions: (1) Conduct research on the construction of education-specific datasets and domain-adaptive prompt tuning. High-quality classroom video-text datasets will be curated, and domain-adaptive prompt tuning will be performed on CLIP-based models to enhance semantic alignment and bridge the domain gap. (2) Conduct research on multi-modal data fusion and cross-modal complementary strategies. The current visual-text framework will be extended to incorporate audio and skeletal modalities. Multi-granularity prompting strategies will be integrated to achieve complementary cross-modal learning, overcoming core limitations of vision-only models. (3) Conduct research on the innovation of few-shot and zero-shot learning paradigms. A dynamically adaptive model architecture will be designed to alleviate the problem of insufficient labeled data for rare behavior categories in real classroom environments. Meanwhile, rigorous comparative evaluations will be conducted between the proposed model and mainstream general-purpose video models (e.g., VideoCLIP and X-CLIP) to ensure its reliability and superiority in practical classroom applications. (4) Conduct research on the construction and practical application of an end-to-end intelligent teaching feedback system. The behavior recognition results of TBRNet will be deeply integrated with multi-dimensional teaching evaluation indicators to build an integrated intelligent teaching feedback system. Ultimately, this system will provide teachers and students with feasible, personalized teaching optimization guidelines, promoting the transformation of research results from technical verification to practical teaching empowerment.

Author Contributions

Conceptualization: T.C. and Y.X.; methodology: T.C., Y.X., C.H. and L.C.; validation: T.C. and C.H.; formal analysis: Y.X. and L.C.; investigation: T.C. and C.H.; resources: Y.X.; data curation: T.C. and C.H.; writing—original draft preparation: T.C.; writing—review and editing: T.C. and Y.X.; visualization: T.C. and C.H.; supervision: Y.X. and L.C.; project administration: T.C. and Y.X.; funding acquisition: T.C. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62377007), the Scientific and Technological Research Program of Chongqing Municipal Education Commission (Nos. KJZD-M202400606, KJZD-M202300603, KJQN202400634), and the Joint Innovation and Development Fund of the Chongqing Natural Science Foundation (No. CSTB2024NSCQ-LZX0133).

Data Availability Statement

The illustrative sample data supporting this study have been available in the repository at https://github.com/cai-KU/TBU since 8 February 2025 (acessed on 15 January 2026). Researchers with legitimate research needs may contact the corresponding author to submit an application for access to the relevant data, subject to review and approval.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TTB	Teacher Teaching Behavior
CLIP	Contrastive Language-Image Pretraining
CCA	Cascaded Collaborative Attention
DQD	Dynamic Query-Driven
MCA	Multi-level Cross-Attention
STOP	Spatio-Temporal Prompts
BCE	Binary Cross-Entropy
TS	TimeSformer

References

Huang, C.; Zhu, J.; Ji, Y.; Shi, W.; Yang, M.; Guo, H.; Ling, J.; De Meo, P.; Li, Z.; Chen, Z. A Multi-Modal Dataset for Teacher Behavior Analysis in Offline Classrooms. Sci. Data 2025, 12, 1115. [Google Scholar] [CrossRef] [PubMed]
Lazarides, R.; Frenkel, J.; Petković, U.; Göllner, R.; Hellwich, O. ‘No words’—Machine-learning classified nonverbal immediacy and its role in connecting teacher self-efficacy with perceived teaching and student interest. Br. J. Educ. Psychol. 2025, 95, S15–S31. [Google Scholar] [CrossRef] [PubMed]
Tammets, K.; Khulbe, M.; Sillat, L.H.; Ley, T. A digital learning ecosystem to scaffold teachers’ learning. IEEE Trans. Learn. Technol. 2022, 15, 620–633. [Google Scholar] [CrossRef]
Zheng, L.; Li, J.; Zhu, Z.; Ji, W. LightNet: A lightweight head pose estimation model for online education and its application to engagement assessment. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 166. [Google Scholar] [CrossRef]
Xiong, Y.; He, C.; Chen, L.; Cai, T. Spatio-temporal graph interaction networks for teacher behavior description in classroom scene. Eng. Appl. Artif. Intell. 2025, 159, 111668. [Google Scholar] [CrossRef]
Pang, S.; Zhang, A.; Lai, S.; Zuo, Z. Automatic recognition of teachers’ nonverbal behavior based on dilated convolution. In Proceedings of the 2022 IEEE 5th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 23–25 September 2022; pp. 162–167. [Google Scholar]
Pang, S.; Lai, S.; Zhang, A.; Yang, Y.; Sun, D. Graph convolutional network for automatic detection of teachers’ nonverbal behavior. Comput. Educ. Artif. Intell. 2023, 5, 100174. [Google Scholar] [CrossRef]
Zhao, G.; Zhu, W.; Hu, B.; Chu, J.; He, H.; Xia, Q. A simple teacher behavior recognition method for massive teaching videos based on teacher set. Appl. Intell. 2021, 51, 8828–8849. [Google Scholar] [CrossRef]
Yuvaraj, R.; Amalin, A.P.; Murugappan, M. An automated recognition of teacher and student activities in the classroom environment: A deep learning framework. IEEE Access 2024, 12, 192159–192171. [Google Scholar] [CrossRef]
Peng, Y.; Lu, S.; Qiu, Z.; Wang, J. Teaching behaviors recognition by combining deep learning-based human body detection and pose estimation. In Proceedings of the 2024 International Symposium on Artificial Intelligence for Education, Xi’an, China, 6–8 September 2024; pp. 594–600. [Google Scholar]
Wu, D.; Chen, J.; Deng, W.; Wei, Y.; Luo, H.; Wei, Y. The recognition of teacher behavior based on multimodal information fusion. Math. Probl. Eng. 2020, 2020, 8269683. [Google Scholar] [CrossRef]
Nguyen, H.; Tran, N.; Nguyen, M.; Nguyen, H.D. Empowering classroom behavior recognition through hybrid spatial-temporal feature fusion. Appl. Intell. 2025, 55, 863. [Google Scholar] [CrossRef]
Wu, D.; Wang, J.; Zou, W.; Zou, S.; Zhou, J.; Gan, J. Classroom teacher action recognition based on spatio-temporal dual-branch feature fusion. Comput. Vis. Image Underst. 2024, 247, 104068. [Google Scholar] [CrossRef]
Hou, P.; Yang, M.; Zhang, T.; Na, T. Analysis of English classroom teaching behavior and strategies under adaptive deep learning under cognitive psychology. Curr. Psychol. 2024, 43, 35974–35988. [Google Scholar] [CrossRef]
Guerrero-Sosa, J.D.T.; Romero, F.P.; Men’endez-Dom’inguez, V.H.; Serrano-Guerrero, J.; Montoro-Montarroso, A.; Olivas, J.A. A Comprehensive Review of Multimodal Analysis in Education. Appl. Sci. 2025, 15, 5896. [Google Scholar] [CrossRef]
Lee, G.; Shi, L.; Latif, E.; Gao, Y.; Bewersdorff, A.; Nyaaba, M.; Guo, S.; Liu, Z.; Mai, G.; Liu, T.; et al. Multimodality of AI for Education: Toward Artificial General Intelligence. IEEE Trans. Learn. Technol. 2025, 18, 666–683. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
Xu, T.; Deng, W.; Zhang, S.; Wei, Y.; Liu, Q. Research on recognition and analysis of teacher–student behavior based on a blended synchronous classroom. Appl. Sci. 2023, 13, 3432. [Google Scholar] [CrossRef]
Zheng, Q.; Chen, Z.; Wang, M.; Shi, Y.; Chen, S.; Liu, Z. Automated multimode teaching behavior analysis: A pipeline-based event segmentation and description. IEEE Trans. Learn. Technol. 2024, 17, 1677–1693. [Google Scholar] [CrossRef]
Xu, F.; Wu, L.; Thai, K.P.; Hsu, C.; Wang, W.; Tong, R. MUTLA: A large-scale dataset for multimodal teaching and learning analytics. arXiv 2019, arXiv:1910.06078. [Google Scholar]
Sümer, Ö.; Goldberg, P.; D’Mello, S.; Gerjets, P.; Trautwein, U.; Kasneci, E. Multimodal engagement analysis from facial videos in the classroom. IEEE Trans. Affect. Comput. 2021, 14, 1012–1027. [Google Scholar] [CrossRef]
Tang, W.; Wang, C.; Zhang, Y. Evaluation Method of Teaching Styles Based on Multi-modal Fusion. In Proceedings of the 7th International Conference on Communication and Information Processing, Beijing, China, 16–18 December 2021; pp. 9–15. [Google Scholar]
Ma, X.; Zhou, J.; Wu, D.; Luo, S. A teacher behavior recognition model based on multi-stream graph convolution network. In Proceedings of the 2023 5th International Conference on Computer Science and Technologies in Education (CSTE), Xi’an, China, 21–23 April 2023; pp. 320–325. [Google Scholar]
Chango, W.; Lara, J.A.; Cerezo, R.; Romero, C. A review on data fusion in multimodal learning analytics and educational data mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1458. [Google Scholar] [CrossRef]
Liang, J.; Pan, W.; Zhang, Z.; Zhou, M. Multi-Modal Detection and Enhancement of Teachers’ and Students’ Behaviors Based on YOLO Object Detection and Voiceprint Technologies. In Proceedings of the 2025 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Ningbo, China, 23–25 May 2025; pp. 100–106. [Google Scholar]
Ni, B.; Peng, H.; Chen, M.; Zhang, S.; Meng, G.; Fu, J.; Xiang, S.; Ling, H. Expanding language-image pretrained models for general video recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–18. [Google Scholar]
Wu, W.; Wang, X.; Luo, H.; Wang, J.; Yang, Y.; Ouyang, W. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6620–6630. [Google Scholar]
Wang, H.; Liu, F.; Jiao, L.; Wang, J.; Hao, Z.; Li, S.; Li, L.; Chen, P.; Liu, X. ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024. [Google Scholar]
Wasim, S.T.; Naseer, M.; Khan, S.; Khan, F.S.; Shah, M. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23034–23044. [Google Scholar]
Wang, Q.; Du, J.; Yan, K.; Ding, S. Seeing in flowing: Adapting clip for action recognition with motion prompts learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5339–5347. [Google Scholar]
Zhang, B.; Zhang, Y.; Zhang, J.; Sun, Q.; Wang, R.; Zhang, Q. Visual-guided hierarchical iterative fusion for multi-modal video action recognition. Pattern Recognit. Lett. 2024, 186, 213–220. [Google Scholar] [CrossRef]
Liu, Z.; Xu, K.; Su, B.; Zou, X.; Peng, Y.; Zhou, J. STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 13776–13786. [Google Scholar]
Khattak, M.U.; Naeem, M.F.; Naseer, M.; Van Gool, L.; Tombari, F. Learning to Prompt with Text Only Supervision for Vision-Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024. [Google Scholar]
Cai, T.; Xiong, Y.; He, C.; Wu, C.; Cai, L. Classroom teacher behavior analysis: The TBU dataset and performance evaluation. Comput. Vis. Image Underst. 2025, 257, 104376. [Google Scholar] [CrossRef]
Soomro, K.; Zamir, A.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar]
Zhang, H.; Liu, D.; Xiong, Z. Two-Stream Action Recognition-Oriented Video Super-Resolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8798–8807. [Google Scholar]
Ranasinghe, K.; Naseer, M.; Khan, S.H.; Khan, F.S.; Ryoo, M.S. Self-supervised Video Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2864–2874. [Google Scholar]
Zong, M.; Wang, R.; Chen, X.; Chen, Z.; Gong, Y. Motion saliency based multi-stream multiplier ResNets for action recognition. Image Vis. Comput. 2021, 107, 104108. [Google Scholar] [CrossRef]

Figure 1. The overall framework of the proposed TBRNet model. TBRNet takes video sequences and textual descriptions as input. Visual and text features, temporal information are extracted through a dual CLIP encoder and TimeSformer. These features are aligned using a semantic alignment prototype and a Cascaded Collaborative Attention (CCA) module. Then, the aligned prototype features guide the temporal information to generate a dynamic query vector for the final classification.

Figure 2. The design of CCA. It consists of multiple layers of MCA units. The first MCA takes the

H_{t}^{(l - 1)}

and

H_{c}^{(l - 1)}

from the previous layer as input. The second MCA uses the

{\tilde{H}}_{t}^{(l)}

(output of the first MCA in the current layer) plus

H_{c}^{(l - 1)}

from the previous layer as input.

Figure 2. The design of CCA. It consists of multiple layers of MCA units. The first MCA takes the

H_{t}^{(l - 1)}

and

H_{c}^{(l - 1)}

from the previous layer as input. The second MCA uses the

{\tilde{H}}_{t}^{(l)}

(output of the first MCA in the current layer) plus

H_{c}^{(l - 1)}

from the previous layer as input.

Figure 3. Example of TBU dataset.

Figure 4. Confusion matrix on TBU dataset.

Table 1. Comparison of the results of different models on the TBU dataset.

Model	Acc/Top-1 (%)	Acc/Mean1 (%)	Acc/F1_Score (%)
C3D [37]	81.3	66.8	71.1
I3D [36]	81.7	70.9	68.3
Slowfast [39]	82.2	66.9	68.1
TSN [38]	82.7	70.9	74.1
Timesformer [17]	83.4	71.9	74.3
BIKE [27]	83.3	78.9	75.5
STOP [32]	84.6	80.5	77.2
TBRNet	86.4	82.1	81.8

Note: Bold values in the table indicate the best performance for the corresponding metric.

Table 2. Comparison of action recognition models on UCF101 and HMDB51 datasets.

Model	Type	Backbone	Acc/Top-1(%)
Model	Type	Backbone	UCF101	HMDB51
C3D+IDT [37]	3D	Resnet-50	90.4	—
I3D [36]	3D	Inception-V1	95.1	74.3
TSN [38]	2D	Resnet-50	94.2	69.4
SoSR+ToSR [40]	Two-stream	-	92.1	68.3
MSM-ResNets [42]	Two-stream	Resnet-50	93.5	69.2
SVT [41]	Attention	ViT-B/14	93.7	67.2
MCB [30]	Attention	ViT-B/14	96.3	72.9
STOP [32]	Attention	ViT-B/32	95.3	72.0
TBRNet	Attention	ViT-B/14	96.2	75.7

Note: Bold values in the table indicate the best performance for the corresponding metric.

Table 3. Recognition Accuracy of Different Teaching Behaviors.

Serial Number	Behavior Category	Recognition Accuracy
0	board writing	0.925
1	erasing the blackboard	0.909
2	operate multimedia	0.658
3	multimedia teaching	0.845
4	teacher bows	0.800
5	displaying teaching aids	0.781
6	lecture on the podium	0.876
7	walking around on the podium	0.661
8	interacting with students on the podium	0.849
9	lecture underneath the podium	0.890
10	walking around underneath the podium	0.849
11	interact with students outside the podium	0.882
12	point to the blackboard	0.736

Table 4. Performance Comparison of Different Module Combinations.

Modules			Acc/Top-1(%)
TS	CCA	DQD	Acc/Top-1(%)
✓	×	×	83.42
✓	✓	×	85.95
✓	×	✓	84.76
✓	✓	✓	86.40

Note: ✓ indicates the module is added; × indicates the module is not added.

Table 5. Performance Comparison of Different Inputs Combinations.

Inputs			Acc/Top-1(%)
ST-R	GV-S	TC-S	Acc/Top-1(%)
✓	×	×	83.42
✓	✓	×	84.69
✓	×	✓	85.78
✓	✓	✓	86.40

Note: ✓ indicates the input is added; × indicates the input is not added.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, T.; Xiong, Y.; He, C.; Chen, L. TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven. Electronics 2026, 15, 460. https://doi.org/10.3390/electronics15020460

AMA Style

Cai T, Xiong Y, He C, Chen L. TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven. Electronics. 2026; 15(2):460. https://doi.org/10.3390/electronics15020460

Chicago/Turabian Style

Cai, Ting, Yu Xiong, Chengyang He, and Lulu Chen. 2026. "TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven" Electronics 15, no. 2: 460. https://doi.org/10.3390/electronics15020460

APA Style

Cai, T., Xiong, Y., He, C., & Chen, L. (2026). TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven. Electronics, 15(2), 460. https://doi.org/10.3390/electronics15020460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TBRNet: A Multi-Modal Network for Teacher Behavior Recognition with Cascaded Collaborative Attention and Dynamic Query-Driven

Abstract

1. Introduction

2. Related Work

2.1. Single-Modal Teacher Teaching Behavior Recognition

2.2. Multi-Modal Teacher Teaching Behavior Recognition

2.3. CLIP-Based Action Recognition

3. Methodology

3.1. Overview of TBRNet

3.2. Video and Text Encoder

3.2.1. TimeSFormer-Based Video Encoder

3.2.2. CLIP-Based Visual Encoder

3.2.3. CLIP-Based Visual and Text Encoder

3.2.4. CLIP-Based Text Encoder

3.3. Cross-Modal Interaction

3.3.1. Cascaded Collaborative Attention

3.3.2. Semantic Alignment Prototype

3.4. Dynamic Query Decoding

3.4.1. Dynamic Query Generation

3.4.2. Decoding and Classification

3.5. The Loss

4. Results and Analysis

4.1. Experimental Protocols

4.1.1. Datasets and Evaluation

4.1.2. Comparison Models and Implementation Details

4.2. Comparison with Previous Works

4.2.1. Comparison of Different Models on TBU

4.2.2. Comparison of Different Models on UCF101 and HMDB51

4.2.3. Evaluation of Recognition Accuracy of Each Category

4.3. Ablation Study

4.3.1. Ablation Study of Different Modules

4.3.2. Ablation Study on Different Modal Inputs

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI