A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model

Liu, Chao; Yang, Fan; Ge, Chengyu; Shao, Zhiyu

doi:10.3390/app15147712

Open AccessArticle

A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model

¹

College of Physical Education, Yangzhou University, Yangzhou 225127, China

²

College of Electrical, Energy and Power Engineering, Yangzhou University, Yangzhou 225127, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7712; https://doi.org/10.3390/app15147712

Submission received: 11 June 2025 / Revised: 4 July 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Special Issue Intelligent Data Processing and Management: Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Analyses of teaching behaviors in physical education (PE) classrooms are critical for evaluating teaching quality. Traditional evaluation methods primarily rely on manual analysis, which suffers from complex coding procedures, low efficiency, and suboptimal accuracy, hindering long-term sustainability in teaching quality improvement. Artificial intelligence (AI) technology offers a novel approach by enabling real-time data collection, automated annotation, and in-depth analysis of teaching behaviors, thereby supporting sustainable PE teaching optimization. Leveraging the CogVLM2-Video model, the research presents a system for real-time data collection, automated annotation, and in-depth analysis of teaching behaviors. It consists of four key modules: The perception layer handles data acquisition and input providing foundational data for analysis. The platform layer manages data processing and storage, ensuring integrity and security for long-term evaluation. The model layer focuses on behavior recognition and analysis, employing advanced algorithms for precise interpretation of teaching behaviors. The application layer delivers real-time feedback and adaptive recommendations, promoting sustained teaching improvement. The system architecture was initially validated using 50 basketball lesson videos. Then, the recognition model was trained on a Kinetics-400 subset, achieving 92% accuracy and 95% consistency with manual annotations. These results demonstrate the system’s practical value and long-term applicability, offering an efficient, precise solution for PE classroom teaching behavior assessment.

Keywords:

AI-driven data processing; teaching behavior analysis; CogVLM2-Video; automated assessment; long-term pedagogical development

1. Introduction

Evaluations of teaching behaviors in physical education (PE) classrooms are critical components of overall classroom teaching evaluations, which directly impact teaching effectiveness and students’ learning experience [1]. However, current evaluations of PE teaching behaviors rely primarily on manual observation and analysis. This approach entails multiple challenges when assessing complex classroom scenarios and attempting to capture specific actions and interactions. Such challenges include inefficiency, subjectivity, and a lack of real-time capabilities [2]. In recent years, rapid advancements in artificial intelligence (AI) technologies, such as computer vision and deep learning, have made automatic recognition and analysis of complex behaviors possible, offering new technological pathways for addressing these issues [3]. Nevertheless, integrating these technologies into PE classroom contexts poses distinct challenges due to the high motion intensity, multi-subject interaction, and spatial variability of physical education settings.

Against this backdrop, this study aimed to design and implement an AI-driven system that enables comprehensive, real-time, and accurate evaluation of teaching behaviors in secondary school PE classrooms. The system targets observable behaviors aligned with China’s curriculum standards, assessing teacher, student, and interactive actions through a framework incorporating entropy, redundancy, ratio, and temporal analysis [4]. Specifically, the study seeks to answer the following research question: Can an AI-based system, leveraging the CogVLM2-Video model, accurately recognize and automatically evaluate teaching behaviors in PE classrooms? Answering this question is expected to provide robust scientific support and practical guidance for improving the quality of physical education teaching.

2. Literature Review

2.1. From Pure Manual to Semi-Automated Traditional Evaluations of Teaching Behaviors in PE Classrooms

The evaluation of teaching behaviors in physical education classrooms is evolving from subjective observation to structured, data-driven approaches, progressing through two key stages: purely manual evaluation and a partially automated evaluation model, which will be discussed below.

2.1.1. Purely Manual Evaluation Stage

During this initial stage, evaluators drew on their accumulated teaching experience and professional knowledge to design appropriate evaluation tools for manually annotating and analyzing teaching behaviors observed in PE classrooms. Cheffers (1983) daptated Flanders’ Interaction Analysis System(FIAS) into the Cheffers’ Adaptation of Flanders’ Interaction Analysis System(CAFIAS) for analyzing three types of behaviors reflecting the classroom learning atmosphere, teaching style, and teaching effectiveness: teachers’ speech, students’ speech, and silence or chaos [5]. Stewart (1989) designed the Observational Recording Record of Physical Educator’s Teaching Behavior (ORRPETB) tool for analyzing three aspects of teacher behavior: classroom time allocation, teacher–student interactions, and teaching behaviors [6]. Quested et al. (2018) designed the Need-Relevant Instructor Behaviors Scale (NIBS) to assess the frequency and intensity of teachers’ behaviors in meeting students’ emotional, competence, and autonomy needs [7]. However, because these methods relied solely on manual evaluation, they were subject to limitations such as subjectivity, small sample sizes, and time-consuming coding processes.

2.1.2. Semi-Automated Evaluation Stage

The rapid advancement of information technology has led to the development of software tools for annotating and analyzing teaching behaviors in PE classrooms for evaluation purposes. Commonly used tools include NVivo, LessonNote, and Mixed Media Annotation (MMA) Video 2.0. NVivo is used to summarize the characteristics and influencing factors of PE classroom teaching by annotating and analyzing three dimensions of classroom teaching videos, namely teacher behaviors, student behaviors, and interaction behaviors [8]. However, in practice, this semi-automated approach is limited by major efficiency issues. Annotating a 45 min video typically requires several hours of manual work. Because of the delay caused by manual annotation, real-time analysis and rapid feedback cannot be achieved, which reduces its value for improving teaching promptly.

While traditional tools such as CAFIAS, ORRPETB, and NIBS have laid the groundwork for behavior evaluation in PE classrooms, they are constrained by subjectivity, time-intensive coding, and a lack of real-time feedback. In highly dynamic physical education (PE) settings, such static frameworks struggle to capture complex, multimodal interactions between teachers and students. Semi-automated tools like NVivo and MMA Video 2.0 improve annotation efficiency but still rely heavily on manual input and post hoc analysis, limiting their scalability and timely instructional relevance. In contrast, AI-based models—such as CogVLM2-Video—integrate deep learning with motion tracking and real-time analysis, enabling high-resolution, continuous data processing with greater objectivity. These systems offer the potential for scalable, data-rich evaluation while also supporting long-term behavior pattern analysis.

2.2. From Convolutional Neural Network-Based Video Action Recognition to Intelligent Evaluation of Teaching Behaviors in PE Classrooms with CogVLM2-Video Models

To address the limitations of traditional manual or rule-based evaluations of teaching behaviors, particularly in dynamic physical education (PE) contexts, researchers have explored progressively advanced AI-driven behavior recognition models. These developments can be broadly categorized into three stages, described below.

2.2.1. Exploratory Stage: Convolutional Neural Network and Temporal Segment Network (2014–2016)

In 2014, Karpathy et al. introduced a video classification method based on a convolutional neural network (CNN), marking the first application of deep learning tools for video data analysis [9]. This method effectively captures spatial features in videos, but their temporal dimensions are less readily captured. As a result, it performs poorly in identifying actions recorded in long-duration videos. To address this limitation, in 2016, Wang et al. proposed integrating CNN within temporal segment networks (TSN) to extract spatiotemporal features, allowing video segmentation and sampling [10]. Although TSN improved the accuracy of action recognition in long videos, it remained inadequate for capturing short-term variations and subtle behaviors in complex PE classroom environments.

2.2.2. Temporal Modeling Enhancements: Convolutional 3D, Inflated 3D, and Temporal Pyramid Network Models (2016–2024)

The temporal pyramid network (TPN) was introduced to overcome the limitations of TSN. Its pyramid-like structure enabled flexible extraction of features at different temporal scales, thus effectively capturing short-term dynamics, while retaining long-term information [11]. Building on this model, researchers developed convolutional 3D (C3D) and inflated 3D ConvNet (I3D) models, which captured spatial and temporal information simultaneously. These models proved effective for analyzing students’ attention and actions in complex teaching settings [12]. However, notwithstanding their theoretical advantages, they faced challenges such as high computational cost and reliance on large-scale video data, which limited their widespread application in evaluations of teaching behavior.

2.2.3. Emergence and Application Potential of the CogVLM2-Video Model (2024–2025)

CogVLM2-Video, introduced in 2024, is a next-generation multimodal video understanding model. Its application in education—especially physical education—is still uncommon. To justify its adoption in this study, we first review recent deep learning models applied to classroom and PE behavior recognition.

In recent years, studies have increasingly applied object detection and pose estimation models to classroom behavior analysis For example, Jia Q et al. combined YOLOv5 with contextual attention and OpenPose for detecting student behaviors, achieving an mAP of 82.1% [13]. Similarly, Han L et al. extended YOLOv8 with two-dimensional positional encoding and multi-head attention to enable accurate real-time detection in complex, multi-subject environments [14]. In parallel, researchers explored advanced temporal modeling techniques. For instance, Wang et al. used 3D CNNs to capture motion dynamics from visual input alone [15]. Building on this, multimodal strategies also emerged. Zheng et al. (2024) introduced a behavior segmentation pipeline integrating visual, audio, and textual features for structured interpretation [16]. Abedi et al. (2021) applied ResNet and Temporal Convolutional Networks (TCN) to estimate student engagement patterns over time [17].

Despite these advances, many models are limited by rigid architectures, low-resolution inputs, and lack of explainable temporal alignment—factors which hindered their effectiveness in real-world PE classrooms. Addressing these limitations, CogVLM2-Video, introduced in 2024, represents a next-generation multimodal vision-language model. As an extension of CogVLM2, it accommodates high-resolution video inputs (up to 1344 × 1344), aligns visual content across multiple frames using timestamps, and supports natural language querying of event context. It has achieved state-of-the-art results on benchmarks such as MVBench and VideoChatGPT-Bench, outperforming models like TimeSformer and VideoMAE [18].

While CogVLM2-Video’s architecture enables visual–language alignment, this study focuses specifically on its visual encoding and temporal recognition components, given the nature of PE classroom videos. Nevertheless, its capacity for fine-grained and interpretable behavior recognition makes it well suited to educational use cases. Related models have already demonstrated promise in adjacent instructional contexts—for example, Peng et al. (2025) applied a similar system to segment and label complex procedural steps in medical training videos, highlighting its effectiveness for structured, temporally grounded analysis [19].

This literature review traces the evolution of teaching behavior evaluation methods in PE classrooms—from manual observation to semi-automated tools and intelligent deep learning–based systems. While traditional methods are foundational, they are insufficient for large-scale, dynamic, and multimodal behavior analysis. The emergence of deep learning models, particularly the CogVLM2-Video framework, provides new technical pathways for efficient, accurate, and scalable evaluation. Building upon earlier systems, this study adopts the CogVLM2-Video model to design an AI-powered evaluation system for PE classrooms. Leveraging its spatiotemporal feature extraction capabilities, the system enables precise annotation and nuanced analysis of complex teaching behaviors. This establishes a robust technological foundation for developing personalized, real-time, and multidimensional feedback mechanisms that support instructional improvement.

3. Materials and Methods

3.1. System Design Methodology

The CogVLM2-Video model-based system developed to evaluate teaching behaviors in PE classrooms comprises four core components: a perception layer, a platform layer, a model layer, and an application layer (see Figure 1). At the lowest level, the perception layer provides the interface for acquiring external information. Its primary function is to collect data on teacher and student behaviors and their interaction behaviors during classroom sessions.

The platform layer preprocesses raw data, generating a database for storing user information and statistical results and facilitating the development of front-end and back-end systems using Vue 2.6.10 (a JavaScript framework for building user interfaces) and Spring Boot 2.6.13 (a Java-based framework for building backend applications). This layer operates on a cloud server, ensuring seamless connectivity using a network topology.

The model layer is the core of the system, leveraging the CogVLM2-Video toolkit for in-depth analysis of video data. This layer extracts static features through an RGB network and dynamic optical flow features via a flow network. Integrating these two types of features, the model layer can effectively recognize complex teaching behaviors and convert them into structured data suitable for processing using the evaluation model. This process ensures precise translation of raw data on teaching behaviors into actionable insights. Once the data have been successfully processed by the evaluation model, intelligent analysis algorithms generate evaluation results, which are stored in the cloud server database. The results are then transmitted in real time to the front-end interface, providing an intuitive and visually rich presentation.

The application layer comprises four core modules. The theory display module showcases key theories on classroom teaching behaviors. The statistical analysis module generates graphical representations and reports based on the evaluation results. The classroom assessment module generates quantitative evaluations of teaching behaviors, enabling the provision of data-driven feedback. Finally, the user management module supports multi-user access and permission management, ensuring secure and efficient data interactions.

By integrating these advanced technologies, the system architecture provides a comprehensive framework for collecting, processing, and analyzing teaching behavior data, offering users clear and practical insights in an accessible interface.

3.2. Technical Implementation Principles

The system’s architecture was designed according to the principle of separating concerns [20], covering three developmental aspects: data collection, front-end display, and back-end processing (see Figure 2). Developers working on the data collection module managed external interfaces, while front-end developers focused on user experience and interface design, and back-end developers handled data processing, algorithm optimization, and secure storage.

The first step of the technical implementation pathway comprises recording classroom teaching using high-resolution cameras, with video streams subsequently transmitted in a stable manner to the processing module on the cloud server via Real-Time Streaming Protocol (RTSP). The CogVLM2-Video neural network model automatically recognizes video content and classified actions, and the evaluation model processes the results, which are then displayed on the front-end interface. Multiple cameras are strategically placed in key classroom locations to achieve comprehensive coverage of teacher and student behaviors and their interaction behaviors.

3.3. Data Analysis

3.3.1. System Architecture Verification and Analysis

To validate the system’s architecture, simulation experiments were conducted using 50 classroom videos from secondary PE lessons. These videos were pre-processed and analyzed by the CogVLM2-Video model, which achieved 90% accuracy in teaching behavior recognition and 95% consistency with manual annotations (see Table 1), confirming the system’s operational effectiveness.

To ensure the reliability of the manual annotations used as reference, a subset of 20 videos was independently labeled by two trained coders. Inter-rater agreement reached a Cohen’s Kappa of 0.87, indicating high consistency. These verified annotations served as the ground truth for evaluating the model’s outputs across the full dataset.

The 50 video samples covered a range of instructional phases—including teacher demonstration, student practice, interaction, and transitions—enhancing the representativeness of the evaluation. Although no formal baseline model was applied, the expert-coded annotations provided a valid benchmark under authentic classroom conditions.

3.3.2. Application Validation and Analysis of the System

To validate the sustainable application of this system, a representative case analysis was conducted using video-recorded “pass-and-cut coordination” drills from a high school basketball lesson as the exemplary case. The instructor, holding a Master of Education degree, conducted the lesson with 32 Grade 10 male students (Level V according to China’s Physical Education Curriculum Standards) [4]. The instructional content progressed through three sequenced components: (a) fundamental drills for cutting routes and passing timing in give-and-go plays, (b) tactical application through constrained practice progressing to modified games, and (c) sport-specific conditioning exercises

Aligned with the pedagogical principles of China’s Physical Education and Health Curriculum Standards, this lesson employed instructional strategies that balanced efficiency with long-term goals—a strength explicitly recognized by instructional evaluation experts. The CogVLM2-Video system performed real-time analysis of teaching behaviors with optimized computational efficiency, illustrating how sustainable AI technologies can support scalable and continuous assessment in physical education.

3.4. Ethical Considerations

This study used publicly available, de-identified video and textual materials, involving no direct intervention or identifiable personal information. As such, individual informed consent was not required under local ethical guidelines. All classroom teachers and school administrators involved in system implementation were informed of the research purpose and data use in advance.

The study focused on general patterns of teaching behavior within the context of China’s Physical Education and Health Curriculum Model, without targeting any specific individuals or groups. Therefore, the risk of causing adverse effects on participants was minimal. Nevertheless, if unintended consequences do arise from the interpretation or dissemination of the results, the research team will proactively engage with relevant parties to clarify the context and meaning of the findings and seek appropriate solutions to minimize potential impact.

This research complies with institutional ethical standards for educational research. Formal ethics approval was not required due to the use of non-interventional, anonymized public data.

4. Results

4.1. Implementation of the CogVLM2-Video Model-Based System for Evaluating Teaching Behaviors in PE Classrooms

A framework based on the CogVLM2-Video toolkit has been developed for implementing the evaluation system (Figure 3). It covers four key stages: (1) constructing an evaluation index, (2) collecting intelligent data, (3) establishing the model, and (4) developing the platform. The first stage establishes the foundation of the evaluation process and defines the measurement dimensions. The second stage involves efficient collection of accurate teaching behavior data, which supports subsequent analysis. The third stage enables automated annotation and precise analyses of behaviors to support improved teaching quality. The final stage of platform development integrates all four stages, providing a stable and comprehensive operational environment for smooth system performance.

4.1.1. Construction of Diversified Evaluation Indicators for Teaching Behaviors as the Foundation for AI Integration and Precise Annotation

The construction of a diversified indicator system for evaluating teaching behaviors in PE classrooms forms the foundation for integrating AI technologies and achieving precise machine annotation. This system translates complex, hard-to-quantify behavioral characteristics into actionable and measurable indicators.

The indicator framework was developed based on prior studies and aligned with China’s Physical Education and Health Curriculum Model. Its validity and applicability were confirmed through a two-round Delphi consultation with 22 experts in physical education, pedagogy, and psychology. In the first round, experts rated each indicator’s importance and clarity on a 5-point Likert scale and provided qualitative feedback. Indicators with low average scores or high variability were revised or removed. In the second round, the revised indicators were re-evaluated. All indicators showed strong consensus, with coefficients of variation below 0.25 and average scores above 3.5. The Kendall’s W coefficients for the first-, second-, and third-level indicators were 0.146, 0.431, and 0.293, respectively, all statistically significant at p < 0.01.

Based on the second-round ratings, weights were calculated for the three first-level indicators. Interaction behavior received the highest weight (0.659), followed by student behavior (0.185) and teacher behavior (0.156), reflecting the curriculum’s emphasis on student-centered and interaction-driven instruction.

The final evaluation system included three core dimensions—teacher behaviors, student behaviors, and interaction behaviors—along with 12 secondary indicators. These indicators were explanation and demonstration, guidance and evaluation, technology use, transitions, warm-up exercises, skill practice, fitness training, relaxation exercises, questioning and responses, cooperative learning, teacher–student competitions, and peer discussion and evaluation (see Figure 4) [21].

4.1.2. Intelligent Data Collection: High-Precision Multi-Camera Setup for Comprehensive Non-Intrusive Data Collection and Pre-Processing

The module on intelligent data collection uses a Sony Alpha A7 IV high-definition camera (Sony Corporation, Wuxi, China), which features a 1080p (1920 × 1080) resolution and a frame rate of 30 fps, integrated with five-axis stabilization technology. Optimized for dynamic classroom settings, the camera’s high resolution ensures accurate motion capture, enhancing action classification accuracy, video smoothness, and optical flow quality. The built-in five-axis stabilization allows for stable data collection under both handheld or motion (e.g., movement during physical activities or teacher-student interactions) conditions, effectively mitigating environmental interference and ensuring the acquisition of high-quality classroom interaction data. Additionally, the camera is equipped with a wireless transmission module, enabling real-time uploading of classroom data to the cloud server for further system analysis and processing.

The data pre-processing process comprises the following steps. The first is noise reduction using a median filtering algorithm to eliminate potential noise artifacts during data collection. Next, the video is segmented into 2 s time windows with a 0.5 s overlap. Data normalization is then applied to video frames by adjusting pixel values to the range [0,1] to meet the input requirements of the CogVLM2-Video model. Static features are extracted from the pre-processed data using RGB images, while dynamic features are derived from optical flow images. The Farneback method is used to generate optical flow images by calculating motion vectors between consecutive frames.

4.1.3. Platform Construction: Front-End and Back-End Technologies with MySQL Database Used to Create an Efficient, Compatible, and Stable System Environment

The platform developed for the evaluation system targeting teaching behaviors in PE classrooms was designed to ensure efficiency, compatibility, and stability by integrating advanced front-end and back-end technologies with a robust MySQL 8.0.32 database. At the front end, Vue, HTML, CSS, JavaScript, and Ajax technologies are used to deliver an intuitive, user-friendly interface. Vue serves as the primary framework, enabling component-based development to enhance code modularity, reusability, and readability. HTML and CSS are used to establish a clear structural layout and an esthetically pleasing interface design, ensuring cross-platform responsiveness on desktop, tablet, and mobile devices. JavaScript and Ajax boost the system with asynchronous data interaction capabilities, enabling real-time updates, which significantly enhance user experience by ensuring smooth and responsive system operations.

The back-end framework uses the SSM architecture (Spring 2.6.13, SpringMVC 2.6.13, and MyBatis 3.5.16), thus facilitating lightweight and modular development of web applications. Core business logic is managed via the Spring framework, which simplifies the configuration and integration of components through dependency injection. SpringMVC handles user requests, facilitating efficient front-end/back-end interaction. For instance, when users upload classroom videos via the Vue interface, HTTP POST requests are sent from the front end to the back-end server. The server processes these videos, stores them in a cloud database, analyzes their content using the CogVLM2-Video model, and returns the results in JSON format. These results are then visualized at the front-end interface, enabling users to interact with and interpret the data easily. The persistence layer framework, MyBatis, bridges the application logic and the MySQL database, translating SQL queries into objects and simplifying data management operations.

The MySQL database serves as the backbone of the system’s data management infrastructure. As an open-source relational database management system, MySQL is renowned for its high performance, stability, and scalability. The system leverages MySQL’s support for SQL-based data operations, enabling flexible and customizable configurations to manage and analyze data related to PE teaching behaviors, thus facilitating the evaluation and improvement of teaching practices. This allows for tailored analysis and reporting to meet the diverse needs of physical education evaluation. Additionally, MySQL offers advanced transaction management and concurrency control mechanisms, ensuring data integrity and consistency, even under high-concurrency scenarios. Its cross-platform compatibility allows deployment across multiple operating systems, including Windows, Linux, and macOS, further enhancing its adaptability and portability.

Through seamless integration of these front-end and back-end technologies with the MySQL database, the platform ensures a reliable, efficient, and user-friendly environment. This enables it to process and analyze large volumes of teaching behavior data in real time, providing actionable insights for enhancing teaching practices in PE classrooms.

4.1.4. Model Development: Integrating the CogVLM2-Video Model with Multi-Algorithm Fusion for Automated Annotation and Comprehensive Analysis of Teaching Behaviors

➀: CogVLM2-Video Model

CogVLM2-Video is a next-generation video understanding foundation model jointly developed by Tsinghua University and Zhipu AI. As the video-specific variant within the CogVLM multimodal model family, it extends the original vision-language understanding capabilities to support multi-frame video inputs, enabling the modeling of temporal sequences and comprehension of dynamic scenes. Based on input teaching videos and guided by textual prompts, CogVLM2-Video can automatically recognize teacher behaviors, student actions, and teacher–student interactions. It performs structured annotation and classification of teaching behaviors, making it a crucial model for intelligent classroom teaching evaluation.

The core architecture of CogVLM2-Video is designed around multimodal video understanding. The workflow begins with frame-wise decomposition and timestamp injection, extracting key frames and assigning precise temporal markers to support subsequent temporal modeling. The visual encoder adopts a Vision Transformer based on EVA-CLIP for high-resolution feature extraction from frame images. A 2 × 2 convolutional layer is employed to compress the feature sequence length, improving computational efficiency while capturing fine-grained visual details in teaching scenes—such as blackboard explanations by the teacher or students’ physical actions. An adapter module, composed of convolutional layers and SwiGLU activation functions, aligns visual features with the input space of the language model, ensuring deep integration of visual semantics and textual instructions. The language decoder, powered by the LLaMA3-8B large language model, combines temporal embedding features and multimodal prompts to dynamically generate behavior classification labels and temporal localization outputs. For instance, it can identify events such as “the teacher begins warm-up activities at second 15” or “students engage in group training between seconds 20 and 25.”As illustrated in Figure 5, the entire system uses a cross-modal attention mechanism to optimize the interaction between spatiotemporal features and language representations. This enables not only precise localization of when teaching behaviors occur but also semantic interpretation of complex interaction scenarios. Ultimately, it forms a complete closed-loop system—from video parsing to behavior-level semantic output—offering strong technical support for automated classroom analysis and intelligent feedback.

➁: Intelligent Algorithm Analysis Model

(1): Information Entropy Analysis

Information entropy analysis measures the level of activity in PE classrooms. A larger entropy value indicates more diversified teaching behaviors, while a smaller value suggests monotonous teaching activities. Teachers’ proficiency in controlling the pace of classroom learning and guiding students is reflected in their behavioral changes. Similarly, changes in students’ behaviors reflect their classroom participation and self-directed learning. The following formula, originally proposed by Shannon, was used to calculate the information entropy of teaching behaviors:

H = - \sum_{i = 1}^{n} p_{i} {l o g}_{2} p_{i},

(1)

where

i = 1,2, 3, \dots, n, p_{i}

represents the probability of the occurrence of the

i

-th type of teaching behavior during the PE classroom session,

m_{i} / M . M

represents the total number of teaching behavior samples, and

m_{i}

is the total number of occurrences of the

i

-th type of teaching behavior [22]. According to the principle of entropy, when each teaching behavior type occurs with equal probability, the information entropy

H

reaches its maximum value, expressed as

H_{m a x} = - {l o g}_{2} p_{i} .

(2): Redundancy Analysis

Redundancy analysis refers to the degree of overlap in calculation content among various evaluation indicators. It emphasizes the extent to which teachers reinforce key concepts and difficult points, and the extent to which certain behaviors or learning processes are repeated in students’ comprehension and practice, which may indicate the need for additional explanation or practice. The redundancy (r) of teaching behaviors in this PE lesson was calculated, based on Shannon’s information theory, as

r = 1 - \frac{H}{H_{m a x}},

(2)

where H denotes the information entropy of teaching behaviors recorded during a specific PE lesson;

H_{m a x}

represents the maximum information entropy of the teaching behaviors for that lesson; and

h

is the relative information entropy for the lesson, so that

h = H / H_{m a x}

[22].

(3): Ratio Analysis

Proportional analysis, a commonly used method in PE classroom behavior research, was applied to reveal the distribution of teaching behaviors [23]. Independently calculated proportions of teachers’ behaviors, students’ behaviors, and their interaction are shown in Equations (3)–(5):

J = \frac{J_{b}}{S_{b}} \times 100 %,

(3)

where

J

denotes the proportion of teacher behaviors in a given PE lesson,

J_{b}

is the total number of teacher behaviors during the lesson, and

S_{b}

is the total number of all behaviors in the lesson.

X = \frac{X_{b}}{S_{b}} \times 100 %,

(4)

where

X

represents the proportion of student behaviors during a given PE lesson,

X_{b}

is the total number of student behaviors, and

S_{b}

is the total number of all behaviors during the lesson.

D = \frac{D_{b}}{S_{b}} \times 100 %,

(5)

where

D

represents the proportion of interaction behaviors during a given PE lesson,

D_{b}

is the total number of interaction behaviors, and

S_{b}

is the total number of all behaviors during the lesson.

(4): Temporal Analysis

Temporal analysis uses time-series methods to model the trends and periodicity of teaching behaviors over time [24]. In analyses of PE classrooms, time is plotted on the x-axis, and behaviors are plotted on the y-axis to visualize changes over time. This approach captures trends and relationships among sequential behaviors, illustrating their patterns and temporal distribution.

4.1.5. Model Training

The publicly available Kinetics-400 dataset, which encompasses 400 action categories, was used to train the model [25]. From this dataset, 50 categories considered relevant to classroom teaching behaviors (e.g., “teacher demonstration” and “student interaction”) were selected. The training configuration comprised a batch size of 16, an initial learning rate of 0.001 (adjusted using cosine annealing), the Adam optimizer, and 50 epochs. Feature fusion techniques (e.g., weighted averaging of Red–Green–Blue (RGB) and optical flow features) were applied to enhance the accuracy of classifications. Data augmentation methods (e.g., random flipping and cropping) were used to improve the model’s generalizability. The model achieved an accuracy rate of 92% for the test set, with a consistency of 95% relative to manual annotations (Table 2).

4.2. Application of the CogVLM2-Video Model-Based System for Evaluating Teaching Behavior in PE Classrooms

The video lesson example of “Basketball Give-and-Go (Cut) Cooperation” at Level V (Grade 10 of senior high school) was imported into the system for data collection, analysis, and deconstruction. The following characteristics of this lesson were derived.

4.2.1. High Information Entropy, Low Redundancy: Stimulating Classroom Vitality and Eco-Constructivism

Precise calculations using intelligent algorithms revealed that the information entropy of this lesson was 3.202 bits, indicating considerable diversity in teaching behaviors. The teacher employed a variety of instructional strategies, thereby avoiding monotony and significantly enhancing student engagement and autonomy. The redundancy rate was 0.180, suggesting minimal repetitive behaviors, which enabled the teacher to manage classroom time efficiently and focus on critical knowledge points.

4.2.2. From the Macro to the Micro: Comprehensive Deconstruction of the Classroom Structure

Figure 6 depicts the results of an analysis of the overall distribution of teaching behaviors within the classroom structure. The proportion of student behaviors was 61.18%, highlighting the student-centered nature of the lesson in alignment with modern pedagogical principles. The proportion of teacher behaviors was 17.72%, revealing the teacher’s role as an organizer and facilitator rather than as a traditional “knowledge provider”. This shift in roles effectively fostered students’ independent learning abilities. Interaction behaviors represented 21.10%, indicating frequent exchanges between teachers and students as well as among students. This dynamic and collaborative learning environment further enriched the classroom atmosphere.

The following main characteristics were identified from the specific class segments (see Table 3).

➀: The Teaching Principle of “Less Talking, More Practicing” Is Implemented, Emphasizing Circulatory Guidance and Evaluation During the Teaching Process

In this exemplary PE class, the proportion of explanation and demonstration was approximately 5% (2 min), with an almost equal proportion of guidance and evaluation behavior (5%; 1.83 min) during the teaching process. Transitions accounted for approximately 8% of the time (3.17 min). This indicates that during the teaching process, teachers allocate rather limited time for collective pauses to deliver explanations and demonstrations. Instead, they prioritize and place greater emphasis on conducting proactive and continuous roving guidance among students during the practice and learning phase. This gave them ample time to practice and master their motor skills.

➁: Learning and Practicing Structured Motor Skills, with Students Guided to Apply What They Learned

In the motor skill learning and practice segment, the proportions of combination movement exercises and demonstrations and competitions, respectively, accounted for approximately 6% (2.5 min) and 22% (8.5 min) of the teaching process. These proportions indicate that the teacher was adept at guiding students to learn and practice structured motor skills. The teacher emphasized not only the practice of combination movements, but also the application of skills through simulated combat and competitions, achieving a seamless transition from learning to application. This approach enabled students to deepen their understanding through practice and apply what they learned, thereby fulfilling the teaching goal.

➂: Harmonious Interactions Between Teachers and Students, Creating a Positive and Uplifting Classroom Atmosphere

Interaction behaviors accounted for 21% of the total time, with co-participation of teachers and students in competitions constituting the largest share (13%). This result highlights the teacher’s active participation, fostering a collaborative and engaging classroom atmosphere. Co-participation not only enhanced enjoyment of the lesson but it also facilitated real-time guidance, promoting mutual growth of teachers and students.

➃: High Exercise Density, Ensuring Sufficient, Effective, and Sustainable Exercise Time for Students

The physical activity density of the lesson, calculated at 85% (2020/2370 × 100%), exceeded the 75% threshold stipulated in the National Physical Education and Health Curriculum Standards (2017 Edition) of High Schools [4]. This result indicates that the lesson effectively maximized students’ physical activity within the available time, ensuring optimal skill and sustainable fitness development.

➄: Low Level of Questioning and Limited Learning Depth

Questioning, which accounted for only 3% of the lesson, comprised mainly closed-ended questions with expected responses and general feedback. This limited focus on open-ended questioning reduced opportunities for students to engage in deeper thinking. To address this issue, the teacher could incorporate open-ended, contextual questions to stimulate creative and critical thinking. One example is: “How can you adapt the pass-and-cut strategy in a real game scenario?”

➅: Low Level of Application of Modern Information Technology and the Need to Improve Information Literacy

The teacher did not employ any modern information technology tools during the lesson. This finding reveals a gap in technology integration and suggests a need for professional development in information literacy. Schools could offer targeted training on integrating technologies such as wearable devices, video analysis tools, and online feedback platforms to enrich teaching methods and meet diverse student needs.

4.2.3. From Point to Area, Precisely Depicting the Teaching Process Framework

The trends in changing teaching behaviors during this lesson (see Figure 7) indicate that student behaviors were frequent and sustained across multiple time periods, whereas teacher and interaction behaviors were more dispersed and less frequent. This finding highlights the teacher’s role as a facilitator and guide, with students taking the lead in driving classroom progress. This teaching behavior trend can be subdivided into three distinct phases corresponding to successive phases in the lesson’s progression.

Preparation Phase (0–360 s). During the preparation phase, learning and practice activities could be categorized as two behavioral chains: 1-5 (9.1) and 1-4-5 (9.1)-2. These chains indicate that students participated in both general warm-up exercises and targeted preparatory activities, which laid a solid foundation for their subsequent learning and practice stages. The teacher provided explanations and demonstrations while conducting circulatory guidance and evaluation during students’ execution of specific preparatory exercises. The teacher also incorporated closed-ended questioning to ensure that each student received timely support and feedback.

Core Phase (360–2260 s). The skill learning and practice segment comprised four behavioral chains: 1-10-4-6.2 (9.1, 10)-2, 1-4-6.3 (9.1, 11), 1-4-6.3 (9.1)-2, and 1-4-6.3 (11). These chains demonstrate that the teacher designed four learning activities, each of which included detailed explanation and demonstration, smooth transitions, and targeted guidance and evaluation. During students’ group practice or competitive gameplay, diverse interactive activities were integrated, including questioning, collaborative practice, and teacher–student co-participation in games, significantly enhancing classroom interactivity and student engagement. The physical fitness segment included two behavioral chains: 1-4-12-7 and 4-7 (9.1)-4-7 (9.1)-4-7 (9.1). These chains reflected the teacher’s emphasis on creating autonomous, collaborative learning spaces, prioritizing transitional adjustments in group cyclic exercises to ensure efficiency and seamless practice. The teacher monitored the students closely, posed timely questions, and promoted reflection to improve physical fitness outcomes.

Concluding Phase (2260–2370 s). Relaxation activities followed a behavioral chain of 4-8-2. This chain indicates that after completing the physical fitness exercises, the teacher facilitated a smooth transition to relaxation activities. The session concluded with concise feedback and evaluation to consolidate students’ learning outcomes and promote their sustained progress and self-improvement.

5. Discussion

5.1. Advancing Evaluations of Teaching Behaviors in PE Classrooms with Intelligent Technology to Achieve Automation and Precision in Behavioral Analysis

An intelligent, CogVLM2-Video model-based system for evaluating teaching behaviors in PE classrooms was designed and implemented in this study. The system automatically captures, recognizes, and analyzes teacher and student behaviors and their interaction behaviors in PE classrooms and generates detailed reports to assist teachers in optimizing their instructional strategies. Its contributions relating to technological innovation, effective teaching evaluation, and practical applicability are thus significant.

5.1.1. Microservice Architecture and Dual-Layer Database: Novel Integration of Software and Database Design

The system employs the Spring Cloud microservice architecture, dividing functionalities into independent service units to achieve enhanced scalability and maintainability [26]. A dual-layer database structure combined with data backup and recovery mechanisms ensures data security, reliability, and integrity. Inspired by the layered architecture concepts of Buede and Miller, this design concept supports the system’s clarity and stability [27]. Compared with other introduced systems, such as a virtual English teaching system [24] and an online education platform [28], this system supports not only small-scale classrooms but also large-scale PE classrooms using high-resolution cameras to capture dynamic and complex motion scenarios. Furthermore, its algorithmic fusion and visualization capabilities, along with the use of methods such as temporal and entropy analysis, enable precise evaluation of PE classroom teaching. Thus, it provides robust support for optimizing teaching strategies and clear advantages relating to technical implementation as well as educational applications.

5.1.2. Behavior Annotation Technology Based on the CogVLM2-Video Model: A New Approach for Automatically Capturing and Classifying Teaching Behaviors

With the advancement of sensor technology, intelligent wearable devices have been widely adopted for evaluating teaching behaviors. For instance, Prieto et al. used eye-tracking devices combined with machine learning to predict teacher activities [29], and Chengze Ma et al. evaluated student engagement using facial recognition technology [30]. However, these studies were primarily conducted in smart classroom environments and focused on cognitive behaviors, with insufficient attention given to dynamic and sustainable teaching behaviors, which are unique to PE classrooms. By incorporating the CogVLM2-Video model, this system overcomes the limitations of traditional sensors, which primarily collect cognitive data [31]. The automated annotation technology enables the system to capture and classify teaching behaviors automatically, identifying not only teacher and interaction behaviors but also cognitive and physical practices during student learning. It thus fills an existing research gap and meets the unique evaluation needs of PE classrooms, significantly enhancing the accuracy and comprehensiveness of evaluations of teaching behaviors.

5.1.3. Multi-Algorithm Fusion for Intelligent Analysis: A New Paradigm for Quantifying Teaching Behaviors in PE Classrooms

Leveraging the high-precision CogVLM2-Video model and deep learning technologies, the system integrates multiple algorithms tailored to the characteristics of PE classrooms, including entropy, redundancy, ratio, and temporal analyses. These algorithms provide a multifaceted approach for uncovering teaching behavior patterns and deconstructing classroom structures. Compared with traditional in-person observations and qualitative assessments, entropy analysis offers more precise quantification, addressing the limitations of accuracy and depth of qualitative methods, and revealing the complexity and diversity of teaching processes [32]. Redundancy analysis evaluates the effectiveness of repetitive teaching strategies, helping teachers to optimize appropriate instruction, prevent over-teaching that could prompt attention loss, and allocate teaching time more scientifically to enhance classroom interactions and learning efficiency [33]. Ratio analysis quantifies the proportions of teacher, student, and interaction behaviors, visually demonstrating structural classroom models (lecture-oriented, practice-oriented, or dialogic). Lastly, by dynamically monitoring behavioral changes over time, temporal analysis reveals behavioral characteristics and distributions, thereby improving the management of PE classroom teaching. These multidimensional analyses offer robust support for accurate evaluation and optimization of PE teaching behaviors.

5.1.4. A New Model for Precise Teaching Feedback Offering a Fully Automated Data Collection, Analysis, and Reporting Platform

The system effectively achieves a fully automated “intelligent data collection—intelligent analysis—intelligent reporting” workflow, significantly reducing the need for manual intervention. High-speed cameras continuously record classroom activities, eliminating the need for wearable devices and making the system more suitable for large-scale, group-oriented PE classrooms, compared to systems that rely on IMU sensors [34]. Intelligent analysis powered by the CogVLM2-Video model precisely captures complex PE teaching behaviors, enabling fine-grained analysis that avoids misclassification and omission errors [35]. The reporting functionality visualizes the results of analyses, enhancing efficient and accurate data interpretation and providing objective feedback for optimizing the scientific validity of classroom evaluations.

5.2. Upholding Ethical Standards and Data Security While Enhancing PE Teachers’ Digital Literacy

Technology is a double-edged sword, presenting both unprecedented opportunities and risks through the integration of AI into systems for evaluating teaching behaviors in PE classrooms. Classroom interactions have been gradually quantified through increasing collection, analysis, integration, and sharing of data on teaching behaviors, making the teaching process and student performance more observable and measurable. While this transformation facilitates evaluations of teaching behaviors and instructional improvement, it also raises concerns about potential information leakage and privacy infringements, posing challenges to personal data security [36]. Prioritizing information security and privacy protection during the collection, analysis, and application of data on teaching behaviors is imperative. This requires strict adherence to relevant legal regulations and ethical guidelines to ensure the legitimacy and propriety of data handling activities [37].

In this study, high-resolution video recordings that captured facial expressions, body movements, and interaction details were used for behavioral analysis [38,39]. Accordingly, the system complies with stringent data protection regulations. Advanced encryption technologies, secure cloud storage, and routine security audits are implemented to safeguard data against unauthorized access, leaks, and cyberattacks [40]. Given the CogVLM2-Video model’s reliance on extensive training data, it is critical to ensure data neutrality to avoid biases that could result in discriminatory outcomes. This means addressing issues of fairness, accountability, and transparency in algorithm design and application to comply with ethical requirements and reasonableness [41].

Ensuring the digital literacy of PE teachers is also essential. According to Institute of Electrical and Electronics Engineers (IEEE)’s global initiative, “All technical professionals must be educated, trained, and certified to prioritize ethical considerations in the design, development, and application of intelligent systems” [42]. As users of this system, PE teachers need to acquire proficiency in interpreting and applying AI-generated data while leveraging their domain expertise. Targeted training programs are crucial to enabling teachers to gain a comprehensive understanding of the strengths and limitations of AI-based evaluation systems and of related ethical considerations. Teachers need to develop critical skills for evaluating AI insights judiciously, ensuring that instructional decisions are effective and aligned with their students’ best interests.

6. Conclusions

An intelligent CogVLM2-Video model-based system for evaluating teaching behaviors in PE classrooms was designed and implemented in this study. The system demonstrates the capability of accurately capturing and analyzing teacher, student, and interaction behaviors, providing comprehensive and objective data to support classroom evaluation and optimize instructional strategies. Theoretically, the study innovatively integrated AI with quantitative behavior analysis, proposing a multi-algorithm fusion evaluation framework specifically tailored to the unique dynamics of PE classrooms. This framework represents a significant advance in the application of AI in the education field by addressing the complex and interactive nature of PE teaching scenarios.

Practically, the system has proven its utility in actual classroom applications, exhibiting high recognition accuracy and real-time performance. By offering reliable data-driven insights, it supports effective monitoring of teaching quality and facilitates educators’ professional development. These contributions reveal its potential to transform traditional methods for evaluating teaching behaviors by promoting data-informed strategies to enhance classroom effectiveness.

Despite these achievements, the system has certain limitations that warrant further exploration. High data annotation costs, difficulties in recognizing behaviors within complex scenarios, and concerns surrounding privacy protection remain ongoing challenges. To address them, future research will focus on optimizing the algorithmic structure by incorporating lightweight deep learning models, such as MobileNet, and leveraging transfer learning techniques to reduce annotation costs and computational demands. Although CogVLM2-Video is a vision-language model, this study used only its visual and temporal behavior recognition functions. The language alignment module (e.g., textual prompts or video-based Q&A) was not activated. However, its architecture supports future integration of these capabilities, which could improve interpretability by linking actions with instructional discourse or verbal cues. Furthermore, the case-based approach used here, while validating feasibility, limits generalizability. Future research should apply the system across varied teaching scenarios—including different sports, age groups, and instructional settings—to improve robustness and adaptability. Training with domain-specific data, such as motor skill instruction or game-based learning, could also optimize system performance in diverse PE environments.

In sum, this study provides a novel, data-driven approach for improving the quality of PE classroom teaching. By enhancing the scientific rigor and precision of teaching evaluations, the system lays a robust foundation for the dynamic analysis and comprehensive understanding of multidimensional classroom behaviors. These advancements not only optimize teaching evaluation methodologies but also support real-time monitoring of behaviors, fostering a deeper understanding of the interactions and dynamics within educational environments.

Author Contributions

Conceptualization, funding acquisition, methodology, project administration, resources, writing—original draft, writing—review and editing, C.L.; formal analysis, investigation, methodology, writing—original draft, F.Y.; data curation, methodology, software, writing—original draft, C.G.; methodology, project administration, resources, software, resources, writing—review and editing, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Humanities and Social Sciences Research Youth Project of the Ministry of Education of China (grant number 22YJC890013) as well as the Youth Project of Jiangsu Province Social Science of China (grant number 22TYC003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Centra, J.A.; Potter, D.A. School and teacher effects: An interrelational model. Rev. Educ. Res. 1980, 50, 273–291. [Google Scholar] [CrossRef]
Zhou, T.; Wu, X.; Wang, Y.; Wang, Y.; Zhang, S. Application of artificial intelligence in physical education: A systematic review. Educ. Inf. Technol. 2024, 29, 8203–8220. [Google Scholar] [CrossRef]
Górriz, J.M.; Ramírez, J.; Ortíz, A.; Martínez-Murcia, F.J.; Segovia, F.; Suckling, J.; Leming, M.; Zhang, Y.-D.; Álvarez-Sánchez, J.R.; Bologna, G.; et al. Artificial intelligence within the interplay between natural and artificial computation: Advances in data science, trends and applications. Neurocomputing 2020, 410, 237–270. [Google Scholar] [CrossRef]
Ji, L. Interpretation of National Physical Education and Health Curriculum Standards (2017 Edition) of High Schools in China; China Sport Science: Beijing, China, 2017; Volume 38, pp. 3–20. [Google Scholar] [CrossRef]
Cheffers, J.T.F. Cheffer’s adaptation of the Flanders’ interaction analysis system (CAFIAS). In Systematic Observation Instrumentation for Physical Education; Darst, P.W., Zakrajsek, D., Mancini, V.H., Eds.; L-Eisure Press: New York, NY, USA, 1983; pp. 76–96. [Google Scholar]
Stewart, M. Observational recording record of physical educator’s teaching behavior (ORRPETB). In Analyzing Physical Education and Sport Instruction; Darst, P.W., Zakrajsek, D., Mancini, V.H., Eds.; Human Kinetics: Champaign, IL, USA, 1989; pp. 249–259. [Google Scholar]
Quested, E.; Ntoumanis, N.; Stenling, A.; Thogersen-Ntoumani, C.; Hancox, J.E. The need-relevant instructor behaviors scale: Development and initial validation. J. Sport Exerc. Psychol. 2018, 40, 259–268. [Google Scholar] [CrossRef]
Liu, C.; Dong, C.; Ji, L. Analysis of the characteristics and influencing factors of PE classroom teaching under the Chinese health physical education curriculum model. J. Tianjin Univ. Sport 2023, 38, 289–295. [Google Scholar] [CrossRef]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F.-F. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 1725–1732. [Google Scholar] [CrossRef]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proceedings of the 14th ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 20–36. [Google Scholar] [CrossRef]
Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 588–597. [Google Scholar] [CrossRef]
Tu, Q.; Zhao, X.; Gong, D.; Zhang, Q. Improved ECA-ResTCN for Online Classroom Student Attention Recognition. Teh. Vjesn. 2024, 31, 832–836. [Google Scholar] [CrossRef]
Jia, Q.; He, J. Student Behavior Recognition in Classroom Based on Deep Learning. Appl. Sci. 2024, 14, 7981. [Google Scholar] [CrossRef]
Han, L.; Ma, X.; Dai, M.; Bai, L. A WAD-YOLOv8-based method for classroom student behavior detection. Sci. Rep. 2025, 15, 9655. [Google Scholar] [CrossRef]
Wang, Q. Research on Student Movement Behavior Recognition Based on 3D-CNN Algorithm. In Proceedings of the 2024 IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA), Shenyang, China, 28–30 June 2024; pp. 791–1795. [Google Scholar] [CrossRef]
Zheng, Q.; Chen, Z.; Wang, M.; Shi, Y.; Chen, S.; Liu, Z. Automated Multimode Teaching Behavior Analysis: A Pipeline-Based Event Segmentation and Description. IEEE Trans. Learn. Technol. 2024, 17, 1677–1693. [Google Scholar] [CrossRef]
Abedi, A.; Khan, S.S. Improving state-of-the-art in Detecting Student Engagement with Resnet and TCN Hybrid Network. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021; pp. 151–157. [Google Scholar] [CrossRef]
Hong, W.; Wang, W.; Ding, M.; Yu, W.; Lv, Q.; Wang, Y.; Cheng, Y.; Huang, S.; Ji, J.; Xue, Z.; et al. Cogvlm2: Visual language models for image and video understanding. arXiv 2024. [Google Scholar] [CrossRef]
Peng, C.; Zhang, K.; Lyu, M.; Liu, H.; Sun, L.; Wu, Y. Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning. arXiv 2025. [Google Scholar] [CrossRef]
Qu, C.; Wang, J. Application and research of ASP/ADO technology in the development of distance teaching systems. J. Nantong Univ. Nat. Sci. Ed. 2003, 82–85. Available online: https://kns.cnki.net/kcms2/article/abstract?v=bTgd32KJj6u_1XGHAf8bkr5ldxyRbIpcLek46yzSACHst6tEmUVkFfuaDGXT9pqlczdL5GFJqvsRL82FvVPws55aeOSdL_BGESldySycmBDx5xHv7uU2taB4bxm2TDQdjbA2NPeW3JsWm0s9hT9Rp4rJmuCRQTXpH9qlxYlQ6qgmAjGk4wncwlwa0rDR2X-C&uniplatform=NZKPT&language=CHS (accessed on 6 July 2025). (In Chinese).
Liu, C.; Dong, C.; Li, X.; Huang, H.; Wang, Q. Analysis of Physical Education Classroom Teaching after Implementation of the Chinese Health Physical Education Curriculum Model: A Video-Based Assessment. Behav. Sci. 2023, 13, 251. [Google Scholar] [CrossRef] [PubMed]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
González-Peño, A.; Franco, E.; Coterón, J. Do observed teaching behaviors relate to students’ engagement in physical education? Int. J. Environ. Res. Public Health 2021, 18, 2234. [Google Scholar] [CrossRef]
Jin, G.; He, L.; Tsai, S.B. An Empirical Study on Virtual English Teaching System Based on the Microservice Architecture with Wireless Internet Sensor Network. Math. Probl. Eng. 2021, 2021, 8494410. [Google Scholar] [CrossRef]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017. [Google Scholar] [CrossRef]
Blinowski, G.; Ojdowska, A.; Przybyłek, A. Monolithic vs. microservice architecture: A performance and scalability evaluation. IEEE Access 2022, 10, 20357–20374. [Google Scholar] [CrossRef]
Buede, D.M.; Miller W, D. The Engineering Design of Systems: Models and Methods; John Wiley & Sons: Hoboken, NJ, USA, 2024; p. 220. [Google Scholar]
Miao, K.; Li, J.; Hong, W.; Chen, M. A Microservice-Based Big Data Analysis Platform for Online Educational Applications. Sci. Program. 2020, 2020, 6929750. [Google Scholar] [CrossRef]
Prieto, L.P.; Sharma, K.; Dillenbourg, P.; Jesús, M. Teaching analytics: Towards automatic extraction of orchestration graphs using wearable sensors. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (LAK ‘16), Edinburgh, UK, 25–29 April 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 148–157. [Google Scholar] [CrossRef]
Ma, C.; Yang, P. Research on classroom teaching behavior analysis and evaluation system based on deep learning face recognition technology. J. Phys. Conf. Ser. 2021, 1992, 032040. [Google Scholar] [CrossRef]
Almusawi, H.A.; Durugbo, C.M.; Bugawa, A.M. Innovation in physical education: Teachers’ perspectives on readiness for wearable technology integration. Comput. Educ. 2021, 167, 104185. [Google Scholar] [CrossRef]
Li, Z.; Su, H.; Jiang, C.; Han, J. Machine Learning-Enhanced ORB Matching Using EfficientPS for Error Reduction. Appl. Math. Nonlinear Sci. 2024, 9, 1–15. [Google Scholar] [CrossRef]
Rone, N.; Guao, N.A.; Jariol, M., Jr.; Acedillo, N. Students’ Lack of Interest, Motivation in Learning, and Classroom Participation: How to Motivate Them? Psychol. Educ. Multidiscip. J. 2023, 7, 636–646. [Google Scholar] [CrossRef]
Lander, N.; Nahavandi, D.; Toomey, N.G.; Barnett, L.M.; Mohamed, S. Accuracy vs. practicality of inertial measurement unit sensors to evaluate motor competence in children. Front. Sports Act. Living 2022, 4, 917340. [Google Scholar] [CrossRef]
Zhang, T.; Wu, Y.; Li, X. Dilated Multi-Temporal Modeling for Action Recognition. Appl. Sci. 2023, 13, 6934. [Google Scholar] [CrossRef]
Lupton, D.; Williamson, B. The datafied child: The dataveillance of children and implications for their rights. New Media Soc. 2017, 19, 780–794. [Google Scholar] [CrossRef]
Floridi, L. Establishing the rules for building trustworthy AI. Ethics Gov. Policies Artif. Intell. 2021, 144, 41–45. [Google Scholar] [CrossRef]
Standing Committee of the National People’s Congress of the People’s Republic of China. Personal Information Protection Law of the People’s Republic of China; National People’s Congress Standing Committee: Beijing, China, 2021. [Google Scholar]
European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC. Off. J. Eur. Union 2016, L119, 1–88. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj (accessed on 6 July 2025).
Wiefling, S.; Tolsdorf, J.; Iacono, L.L. Privacy considerations for risk-based authentication systems. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Vienna, Austria, 6–10 September 2021; pp. 320–327. [Google Scholar] [CrossRef]
Akinrinola, O.; Okoye, C.C.; Ofodile, O.C.; Ugochukwu, C.E. Navigating and reviewing ethical dilemmas in AI development: Strategies for transparency, fairness, and accountability. GSC Adv. Res. Rev. 2024, 18, 50–58. [Google Scholar] [CrossRef]
Jedličková, A. Ethical approaches in designing autonomous and intelligent systems: A comprehensive survey towards responsible development. AI Soc. 2024, 40, 2703–2716. [Google Scholar] [CrossRef]

Figure 1. Design of a scalable teaching behavior evaluation system for PE classrooms using CogVLM2-Video.

Figure 2. Implementation principles of the system for evaluating teaching behavior in PE classrooms.

Figure 3. Implementation framework of the system for evaluating teaching behaviors in PE classrooms.

Figure 4. System for evaluating teaching behaviors in PE classrooms.

Figure 5. Teaching Behavior Identification Process Using CogVLM2-Video.

Figure 6. Ratio analysis of classroom behaviors.

Figure 7. Temporal analysis of teaching behaviors.

Table 1. Results of the simulation experiments to validate the evaluation system’s performance.

Experimental Metric	Description	Result Value	Explanation
Video Data Volume	Number of basketball classroom teaching videos collected and pre-processed	50	Collected via the perception layer, including videos with various teaching behaviors
Number of Action Categories	Categories of teaching behavior actions	12	Includes “teacher demonstration,” “student practice,” “interactive Q&A,” etc.
Data Pre-Processing Time	Average pre-processing time per video	15 s	Includes denoising, segmentation, and normalization processes
Action Recognition Accuracy	Accuracy of action recognition based on the CogVLM2-Video model	90%	Classification accuracy of teaching behaviors by the model
Annotation Consistency	Consistency between model results and manual annotations	95%	Manual annotation results were used as a reference
Data Transmission Latency	Average latency from data collection to visualization	1.5 s	Includes the entire process of collection, pre-processing, analysis, and front-end display
Overall System Evaluation	User satisfaction score (1–5 scale)	4.8	Based on feedback from 10 participating teachers

Table 2. Results of the model performance evaluation.

Metric	Description	Result
Action Categories	Total number of action categories in the test set	50
Test Videos	Total number of videos in the test set	200
Action Recognition Accuracy	Correct classification rate of action categories	92%
Annotation Consistency	Consistency between model predictions and manual annotations	95%
Inference Speed	Time to process each frame of video data	25 ms
Recall	Proportion of correctly identified positive samples	90%
Precision	Proportion of true positives among predicted positives	93%
F1 Score	Harmonic mean of precision and recall	91.50%

Table 3. Statistical results for teaching behaviors in the case study.

Category	Proportion (%)	Frequency	Duration (s)
Explanation and Demonstration	5	12	120
Guidance and Evaluation	5	11	110
Technology Use	0	0	0
Transitions	8	19	190
Warm-Up Exercises	10	24	240
Single Technique Practice	0	0	0
Combined Technique Practice	6	15	150
Demonstration and Competition	22	51	510
Fitness Training	20	48	480
Relaxation Exercises	3	7	70
Closed Questions—Expected Responses—General Feedback	3	8	80
Open Questions—Interpretative Responses—Professional Feedback	0	0	0
Collaborative Learning	3	8	80
Teacher-Student Competitions	13	30	300
Peer Discussion and Evaluation	2	4	40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Yang, F.; Ge, C.; Shao, Z. A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model. Appl. Sci. 2025, 15, 7712. https://doi.org/10.3390/app15147712

AMA Style

Liu C, Yang F, Ge C, Shao Z. A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model. Applied Sciences. 2025; 15(14):7712. https://doi.org/10.3390/app15147712

Chicago/Turabian Style

Liu, Chao, Fan Yang, Chengyu Ge, and Zhiyu Shao. 2025. "A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model" Applied Sciences 15, no. 14: 7712. https://doi.org/10.3390/app15147712

APA Style

Liu, C., Yang, F., Ge, C., & Shao, Z. (2025). A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model. Applied Sciences, 15(14), 7712. https://doi.org/10.3390/app15147712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model

Abstract

1. Introduction

2. Literature Review

2.1. From Pure Manual to Semi-Automated Traditional Evaluations of Teaching Behaviors in PE Classrooms

2.1.1. Purely Manual Evaluation Stage

2.1.2. Semi-Automated Evaluation Stage

2.2. From Convolutional Neural Network-Based Video Action Recognition to Intelligent Evaluation of Teaching Behaviors in PE Classrooms with CogVLM2-Video Models

2.2.1. Exploratory Stage: Convolutional Neural Network and Temporal Segment Network (2014–2016)

2.2.2. Temporal Modeling Enhancements: Convolutional 3D, Inflated 3D, and Temporal Pyramid Network Models (2016–2024)

2.2.3. Emergence and Application Potential of the CogVLM2-Video Model (2024–2025)

3. Materials and Methods

3.1. System Design Methodology

3.2. Technical Implementation Principles

3.3. Data Analysis

3.3.1. System Architecture Verification and Analysis

3.3.2. Application Validation and Analysis of the System

3.4. Ethical Considerations

4. Results

4.1. Implementation of the CogVLM2-Video Model-Based System for Evaluating Teaching Behaviors in PE Classrooms

4.1.1. Construction of Diversified Evaluation Indicators for Teaching Behaviors as the Foundation for AI Integration and Precise Annotation

4.1.2. Intelligent Data Collection: High-Precision Multi-Camera Setup for Comprehensive Non-Intrusive Data Collection and Pre-Processing

4.1.3. Platform Construction: Front-End and Back-End Technologies with MySQL Database Used to Create an Efficient, Compatible, and Stable System Environment

4.1.4. Model Development: Integrating the CogVLM2-Video Model with Multi-Algorithm Fusion for Automated Annotation and Comprehensive Analysis of Teaching Behaviors

4.1.5. Model Training

4.2. Application of the CogVLM2-Video Model-Based System for Evaluating Teaching Behavior in PE Classrooms

4.2.1. High Information Entropy, Low Redundancy: Stimulating Classroom Vitality and Eco-Constructivism

4.2.2. From the Macro to the Micro: Comprehensive Deconstruction of the Classroom Structure

4.2.3. From Point to Area, Precisely Depicting the Teaching Process Framework

5. Discussion

5.1. Advancing Evaluations of Teaching Behaviors in PE Classrooms with Intelligent Technology to Achieve Automation and Precision in Behavioral Analysis

5.1.1. Microservice Architecture and Dual-Layer Database: Novel Integration of Software and Database Design

5.1.2. Behavior Annotation Technology Based on the CogVLM2-Video Model: A New Approach for Automatically Capturing and Classifying Teaching Behaviors

5.1.3. Multi-Algorithm Fusion for Intelligent Analysis: A New Paradigm for Quantifying Teaching Behaviors in PE Classrooms

5.1.4. A New Model for Precise Teaching Feedback Offering a Fully Automated Data Collection, Analysis, and Reporting Platform

5.2. Upholding Ethical Standards and Data Security While Enhancing PE Teachers’ Digital Literacy

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI