Deep Learning for Student Behavior Detection in Smart Classroom Environments

Wang, Jue; Sun, Yuchen; Tian, Shasha

doi:10.3390/info16110949

Open AccessArticle

Deep Learning for Student Behavior Detection in Smart Classroom Environments

by

Jue Wang

¹,

Yuchen Sun

² and

Shasha Tian

^3,*

¹

School of Design Engineering, Wuhan Qingchuan University, Wuhan 430204, China

²

School of Teaching, Learning and Curriculum Studies, Kent State University, 800 E Summit St, Kent, OH 44240, USA

³

School of Computer Science, South-Central Minzu University, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 949; https://doi.org/10.3390/info16110949

Submission received: 23 September 2025 / Revised: 16 October 2025 / Accepted: 27 October 2025 / Published: 3 November 2025

(This article belongs to the Special Issue Advancing AI Applications in Education and Engineering: A Multidisciplinary Perspective)

Download

Browse Figures

Versions Notes

Abstract

The ongoing integration of information technology in education has rendered the monitoring of student behavior in smart classrooms essential for improving teaching quality and student engagement. Classroom environments frequently provide many problems, such as heterogeneous student behaviors, significant obstructions, loss of intricate details, and complications in recognizing diminutive targets. These limitations lead to current approaches remaining inadequate in accuracy and stability. This paper enhances YOLOv11 with the following improvements: developed the CSP-PMSA module to enhance contextual modeling in complex backgrounds, developed a scale-aware head (SAH) to improve the perception and localization of small targets via channel unification and scale adaptation, and introduced a Multi-Head Self-Attention (MHSA) mechanism to model global dependencies and positional bias across various subspaces, thereby enhancing the discrimination of visually analogous behaviors. The experimental findings indicate that in intricate classroom settings, the model attains mAP@50 and mAP@50–95 scores of 91.6% and 75.7%, respectively. This indicates enhancements of 2.7% and 2.6% compared to YOLOv11, and 4.6% and 3.6% relative to DETR, demonstrating remarkable detection precision and dependability. Additionally, the model was implemented on the Jetson Orin Nano platform, confirming its viability for real-time detection on edge devices and offering substantial assistance for practical implementations in smart classrooms.

Keywords:

smart classroom; student behavior detection; deep learning; YOLOv11; computer vision

Graphical Abstract

1. Introduction

The progression of intelligent transformation in schooling has emerged as a pivotal trend for future advancement [1]. The use of artificial intelligence technology allows automation of identifying and analyzing student classroom behavior. This boosts understanding of pupils’ learning conditions and classroom performance. It also offers teachers objective, real-time data support. These factors help optimize pedagogical practices, increasing instructional efficiency and classroom management efficacy [2,3]. Conventional approaches mostly rely on human observation and documentation. These methods are vulnerable to subjective influences, prone to tiredness, and time-consuming. As a result, they are hard to implement for widespread and prolonged classroom oversight [4,5]. There is now a pressing need for a more effective and sophisticated technological approach to support classroom behavior analysis.

Due to the swift progression of big data and deep learning, artificial intelligence has attained notable accomplishments in areas such as computer vision, medical imaging, and security surveillance, and is progressively infiltrating the education sector [6]. Li et al. [7] utilized convolutional neural networks (CNNs) to recognize and assess classroom teaching behaviors, significantly improving the precision of student behavior detection for the purpose of facilitating classroom interaction analysis. Several scientists have endeavored to integrate temporal or spatio–temporal modeling methodologies to identify students’ learning states. The IMRMB-Net model introduced by Feng et al. [8] attains a balance between occlusion management, small object behavior identification, and computing efficiency in intricate classroom environments. Furthermore, investigations have examined lightweight detection networks tailored for resource-limited smart classroom settings [8]. For instance, Han et al. [9] augmented multi-scale and fine-grained modeling capabilities via CA-C2f and 2DPE-MHA modules, complemented by dynamic sampling strategies. Meanwhile, Chen et al. enhanced the YOLOv8 methodology, which markedly bolstered robustness in occlusion and small-scale behavior detection [10].

This research has established a foundation for behavior identification in educational contexts; nonetheless, various problems remain in actual classroom settings. In standard classroom settings, the high student population, considerable scale disparities between the front and rear rows, and recurrent occlusion problems significantly diminish detection accuracy. Simultaneously, student behaviors are notably focused, primarily exhibiting static postures or visual indicators such as “listening attentively,” “looking down,” or “standing up.” This imposes increased requirements on models to discern subtle distinctions. Although current research has shown advancements in accuracy, the majority of methodologies depend on intricate network structures or entail substantial computing expenses, making them challenging to integrate into standard teaching practices.

This work presents an enhanced object identification model derived from YOLOv11, with primary contributions including the following:

(1) Developed the CSP-PMSA module, integrating Cross Stage Partial connections with partial multi-scale convolutions to adaptively augment important information representation, thereby significantly enhancing contextual modeling skills in intricate backdrops;

(2) Developed the Small Object Aware Head (SAH), which markedly improves the model’s detection efficacy for diminutive objects and nuanced behaviors via a cohesive channel and scale-adaptive mechanism;

(3) Implemented the Multi-Head Self-Attention (MHSA) mechanism prior to the detection head to capture global dependencies across various subspaces, thereby enhancing the model’s capacity to differentiate visually analogous behavioral categories and its overall resilience.

This paper is organized in the following way: Section 2 gives a brief overview of related research; Section 3 goes into more detail about the proposed model methodology; Section 4 introduces the dataset and evaluation metrics, showing how the experiment was set up, the results, and the analysis; and Section 5 wraps up the whole paper and suggests future research directions.

2. Related Work

2.1. Object Detection

Object detection, a vital job in computer vision, has shown swift progress in recent years. Conventional two-stage methodologies, such as R-CNN [11], Fast R-CNN [12], and Faster R-CNN [13], integrate candidate region generation with classifier recognition to attain elevated detection accuracy, albeit resulting in diminished inference speeds. Subsequent single-stage detection methodologies, such as SSD [14] and the YOLO series [15], redefined detection as a regression task, markedly improving efficiency and facilitating end-to-end training. The YOLO series has evolved through successive iterations, from YOLOv3 [16] to YOLOv5, YOLOv7, YOLOv8, and the most recent YOLOv11, by systematically incorporating residual structures, CSP (Cross Stage Partial) modules, attention mechanisms, and feature fusion networks to improve feature representation and cross-scale modeling capabilities. Simultaneously, academics have proposed numerous improvements to address small object recognition, feature loss, and category ambiguity in complex situations. This encompasses the integration of feature pyramids (FPN [17], PANet [18]), multi-scale convolutions, attention mechanisms, and cross stage information flow. Although these developments establish a solid technical basis for behavior identification in particular applications, additional optimization is possible in complex situations.

2.2. Behavior Detection in Classroom Scenarios

In educational settings, the detection and analysis of student classroom behavior serve as vital means for understanding learning states, optimizing teaching strategies, and enhancing classroom quality. Initial investigations predominantly depended on manual observation and statistical techniques, such as the Flanders Interaction Analysis System (FIAS) [19] and the enhanced Flanders Interaction Analysis System (iTIAS) [20]. Although effective in small-scale classrooms, these systems exhibit low efficiency due to dependence on manual annotation and coding, hence complicating large-scale, automated, and high-precision classroom behavior recognition.

As artificial intelligence and deep learning technologies evolve, researchers increasingly aim to implement computer vision techniques for recognizing classroom behavior. Current methodologies can be classified into two primary categories: those utilizing conventional features and machine learning classifiers, and those employing end-to-end frameworks based on deep learning. The former generally depend on handcrafted features, showing restricted generalization abilities; conversely, the latter have superior benefits in behavior identification because of their strong feature extraction capabilities and resilience. Presently, leading research primarily utilizes deep learning architectures, including dual-stream convolutional neural networks, three-dimensional convolutional neural networks (3D-CNNs), and long short-term memory (LSTM) networks [21,22,23], yielding positive outcomes in video behavior recognition tasks. Nevertheless, these methodologies generally depend on lengthy video sequences, with detection outcomes frequently limited to individual behavioral categories. Their substantial training overhead and high computational costs constrain practical deployment within smart classroom environments.

In recent years, the swift progress of object detection techniques has led researchers to implement single-stage detection algorithms, like YOLO and SSD, for classroom behavior recognition tasks, facilitating the automatic identification of various behavioral categories. In this context, a range of enhanced methodologies has arisen, chiefly suggesting optimization ways to tackle issues such as intricate backdrops, diminutive objects, occlusions, and category resemblance. Wang et al. [24] integrated Deformable DETR with the Swin Transformer and a streamlined feature pyramid architecture to augment cross-scale feature modeling capabilities, thereby enhancing behavior detection efficacy in intricate backgrounds; Peng et al. [25] introduced YOLO-CBD, which markedly advanced small object detection in dense and occluded environments by reconfiguring the backbone network and implementing a multi-scale adaptive mechanism; Zhu et al. [26] developed CSB-YOLO, which improves real-time performance and stability in classroom behavior detection through network architecture optimization and feature fusion, achieving commendable results across various student behavior categories; Chen and Guan [27] incorporated an embedded connection component in the YOLOv4 detector head and utilized a Repulsion loss function, consisting of RepGT and RepBox, significantly decreasing false positives and false negatives, thus enhancing intelligent classroom behavior recognition; Zhang et al. [28] introduced the enhanced YOLO-CBAM algorithm based on YOLOv3, further elevating detection accuracy by integrating generalized intersection-over-union and focus loss.

Current research has significantly advanced in the domain of classroom behavior identification. Nonetheless, obstacles remain in actual classroom environments, such as issues in identifying small items, significant occlusions, and considerable similarities within behavioral categories. Consequently, the robustness and generalization capabilities of models require further enhancement.

3. Materials and Methods

3.1. Dataset Production

The quality and usefulness of samples are very important for training deep learning models well. The literature study helped this paper decide to use the STBD-08 dataset as its research subject. This set of 131 high-quality ministerial-level training videos comes from China’s National Resource Public Service Platform. Segments were made through automatic editing and were then categorized and labeled based on behavior. The STBD-08 dataset includes eight common ways that students connect with each other in the classroom: writing, reading, listening, standing, talking, guiding, turning around, and raising their hand. The sample is made up of 8844 images with a total of 267,888 behavioral bounding boxes. The dataset was split into training, validation, and test sets with an 8:1:1 ratio to make sure that science models could be trained and tested.

Figure 1 shows that the distribution of behavioral samples across groups is very different from one another. Most of the labels are for listening, writing, and reading. On the other hand, behavioral samples like leading and standing are not very common, which shows a clear data imbalance. Figure 2 shows how the relative sizes of target bounding boxes in pictures are spread out. Most target boxes have aspect ratios between 0.05 and 0.2, and most of their heights are between 0.1 and 0.3. This means that most of the targets in the dataset are small things that only take up a small amount of space in pictures. This trait not only makes the detection job harder, but it also puts more pressure on the model to be more reliable when it comes to finding small objects.

3.2. YOLOv11n

YOLOv11 is an open-source model for object detection, developed by the Ultralytics team. The newest installment in the YOLO series not only facilitates object identification but also encompasses various visual tasks, such as image classification, instance segmentation, and real-time tracking. Through extensive optimization of its network design and training methodology, YOLOv11 exhibits exceptional performance in intricate visual contexts. The official release includes five basic models: YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11l, and YOLOv11x, addressing various application requirements from lightweight deployment to high-precision detection. Given the computational limitations of edge devices in smart classroom environments and the strict demands for real-time detection and responsiveness, this paper identifies the most compact and computationally efficient YOLOv11n as the baseline model for performance comparisons and subsequent refinement studies in student behavior recognition tasks. The architecture of YOLOv11 consists of an input layer, a feature extraction layer (backbone), a feature fusion layer (neck), and an output prediction layer (head), which, respectively, manage image preprocessing, deep feature extraction, multi-scale information fusion, and final detection prediction, as depicted in Figure 3. The model comprises two essential structural modules: C3k2 and C2PSA. The C3k2 module enhances YOLOv8’s C2f architecture by allowing flexible transitions between lightweight Bottleneck structures and deep C3 structures using the C3k parameter, hence improving feature representation capabilities. The C2PSA module integrates position-sensitive attention, multi-head attention, and a feedforward network, significantly improving the model’s spatial awareness and detailed feature representation, thereby establishing a solid structural basis for future enhancements.

3.3. Proposed Method

The student behavior recognition model presented in this paper consists of an input layer, backbone network, neck network, detection head, and output layer. The design comprehensively addresses the intricacies of smart classroom environments and the specifications for small item detection. Classroom images are subjected to preprocessing before network entry, which encompasses Mosaic data augmentation, Mixup concatenation, adaptive anchor box computation, and adaptive grayscale filling. They are consistently scaled to conform to the network’s input dimensions. These tactics not only augment the diversity of training samples but also markedly strengthen the model’s resilience to prevalent disturbances in smart classrooms, including fluctuations in lighting, mutual occlusion among students, and the depiction of behavioral actions as diminutive targets within images. The preprocessed pictures initially undergo multi-level feature extraction through the backbone network, followed by multi-scale fusion within the neck network. This augments the model’s representational capability for intricate situations and nuanced behaviors. The detection head subsequently translates this integrated information into classification and localization tasks, facilitating the identification and detection of eight distinct student behaviors. The comprehensive framework is depicted in Figure 4.

This work offers the PMSA module to effectively tackle recognizing issues in smart classrooms caused by intricate backgrounds, student occlusions, and diminutive targets. It reconstructs and optimizes the backbone and neck components, allowing the network to selectively emphasize critical information during feature extraction and fusion phases, thus improving overall discriminative capacity. A Small Object Aware Head (SAH), designed to enhance the detection of diminutive targets, has been developed to decrease model complexity while preserving detection accuracy. This reduces computing costs by eliminating redundant parameters, allowing for good detection performance despite limited computer resources. Additionally, a multi-head attention mechanism is used prior to the detection head to enhance the differentiation of visually analogous student behavior categories. Modeling feature connections across various subspaces increases the detection of nuanced differences. The optimized network attains an equilibrium among precision, minimalistic design, and resilience, providing a feasible and scalable approach for detecting student behavior in smart classroom environments.

3.3.1. CSP-PMSA Module

In intelligent classroom environments, the detection of student behavior is sometimes hindered by intricate backgrounds, where significant extraneous information obscures essential action characteristics, resulting in a loss of detail and reduced categorization precision. This study optimizes the C3K2 module utilizing YOLOv11, creating an innovative CSP-PMSA architecture. As depicted in Figure 5, this module modifies input features to allow the network to concentrate on essential information within specific behavioral spatial areas. This improves contextual acquisition and integration, efficiently maintaining intricate features in complex backgrounds while augmenting feature representation capacities. Simultaneously, the Cross Stage Partial (CSP) architecture is incorporated into this module. The fundamental approach entails dividing input features into two streams: one is processed through the deep convolutional branch to improve feature modeling, while the other is preserved as the cross stage branch, with both streams merged at the output. This inter-stage information transfer diminishes gradient redundancy, optimizes feature utilization efficiency, and augments the network’s training convergence efficacy.

The PMSA module utilizes a blend of partial convolution and multi-scale convolution for feature improvement. The input features initially undergo a 3 × 3 convolution and are subsequently bifurcated: one segment proceeds through 5 × 5 and 7 × 7 convolutions to achieve an expanded receptive field and enhanced contextual information, while the other segment is preserved to avert the loss of fine-grained details due to excessive convolutional depth. This architecture accomplishes complementary modeling of multi-scale elements and intricate details while maintaining original information.

In the feature fusion phase, features from various sizes (shallow-level information at 3 × 3, meso-scale information at 5 × 5, and macro-scale information at 7 × 7) are concatenated along the channel dimension to create a holistic representation that harmonizes detail with contextual understanding. Subsequently, 1 × 1 convolutions enable channel interaction and feature compression, while residual connections augment input and output to improve feature flow and resilience. The CSP-PMSA module, utilizing “CSP cross stage feature fusion, partial convolution, multi-scale modeling, and residual connection,” significantly augments the model’s capacity to capture essential information in intricate classroom environments, thereby enhancing the precision and resilience of behavior detection.

3.3.2. SAH Module

In intelligent classroom environments, students are abundant and dispersed over several seating sections, exhibiting considerable scale disparities between the front and rear rows. The actions of students in the rear row typically appear like minor targets. Concurrently, occlusions and intricate backgrounds significantly intensify the challenges associated with small item recognition. Conventional YOLO-based detectors generally utilize separate branches for the analysis of multi-scale properties, with each branch specifically tasked with identifying tiny, medium, and large objects, respectively. This approach accommodates many item scales but may neglect essential information in small object detection. Moreover, the absence of efficient inter-branch coordination results in inadequate feature utilization.

Given the multi-scale feature maps

F_{s}

,

F_{m}

, and

F_{l}

corresponding to small, medium, and large objects, each feature map is first standardized through a

1 \times 1

convolution:

{\hat{F}}_{i} = {Conv}_{1 \times 1} (F_{i}), i \in {s, m, l} .

(1)

Then, a shared

3 \times 3

convolution is applied:

F_{i}^{'} = {Conv}_{3 \times 3} ({\hat{F}}_{i}) .

(2)

The scale layer introduces a learnable parameter

α_{i}

for each scale, initialized as

1.0

and optimized during training:

F_{i}^{″} = α_{i} ⊙ F_{i}^{'},

(3)

where ⊙ denotes element-wise multiplication. This allows adaptive weighting of different scales to highlight small object features.

Finally, the scaled feature maps are aggregated and passed into the regression branch:

F_{out} = Concat (F_{s}^{″}, F_{m}^{″}, F_{l}^{″}) .

(4)

This process explicitly models scale importance and enhances feature fusion. In this way, the scale layer adaptively adjusts feature responses, leading to more significant enhancement effects for small targets.

In contrast to conventional solo detection heads, the SAH module markedly improves the model’s sensitivity to diminutive objects and intricate features, all while preserving a manageable total processing burden (Figure 6). This design enhances the reliability of student behavior detection in intricate situations, such as rear seating and obstructions, while also making the model more versatile for practical use in smart classrooms.

3.3.3. MHSA Module

In intelligent classroom environments, the detection of student behavior is sometimes impeded by intricate backgrounds, analogous behavioral categories, and changes in scale. Although typical convolutional neural networks proficiently extract local data, they encounter difficulties in modeling global dependencies and capturing inter-regional behavioral variations, frequently resulting in category confusion or imprecise localization. This work presents a Multi-Head Self-Attention (MHSA) technique prior to the three detecting heads of YOLOv11. This improves the feature’s worldwide perception capacities and distinguishing expressiveness. Figure 7 demonstrates that the MHSA module accomplishes the simultaneous modeling of content and location information via parallelized multi-head attention processes. This allows the network to more precisely differentiate various student behaviors within intricate classroom settings.

During the implementation process, the input feature map

x \in R^{H \times W \times d}

is initially processed through three

1 \times 1

convolutions to produce the query (Q), key (K), and value (V) matrices, described as follows:

q = x W^{Q}, k = x W^{K}, v = x W^{V},

(5)

where

W^{Q}

,

W^{K}

, and

W^{V}

are trainable parameter matrices. To improve the model’s spatial awareness, horizontal positional encoding

R_{h} \in R^{H \times 1 \times d}

and vertical positional encoding

R_{w} \in R^{1 \times W \times d}

are incorporated. The aggregate positional bias is expressed as follows:

r = R_{h} + R_{w} .

(6)

Subsequently, the query matrix q is augmented with the positional bias r and is subsequently multiplied by the key matrix k via dot-product computing to derive the attention weights:

α = softmax (\frac{(q + r) k^{T}}{\sqrt{d_{k}}}) .

(7)

The output feature representation is derived by calculating the weighted sum of the value matrix v using the attention weights

α

.

z = α v

(8)

Utilizing multi-head parallel processing, separate attention heads identify correlations among behavioral features across several subspaces, hence improving the network’s capacity to differentiate student behaviors in intricate classroom settings. In contrast to individual convolutional features, the MHSA module markedly enhances discrimination performance for visually analogous actions (e.g., “raising hands” and “writing”), while exhibiting superior robustness in situations characterized by weak boundaries, occlusions, and multi-scale behavioral contexts.

4. Experimental Results and Analysis

4.1. Experimental Settings and Model Training

Table 1 illustrates the setting of the experimental environment used in this investigation, which was Ubuntu 22.04.3 LTS. An NVIDIA GeForce RTX 3090 GPU (24 GB VRAM) and an x86_64 CPU, supplemented by 48 GB of memory, made up the main hardware platform. The software environment was developed using Torch 2.3.1, CUDA 12.8, and Python 3.10.12 to guarantee effective deep learning model training and inference performance. With respect to the model training parameter settings, as described in Table 2, the training procedure used 50 epochs with a batch size of 16 and a fixed input image size of 640 × 640. In order to avoid overfitting, SGD was chosen as the optimizer, with an initial learning rate of 0.01 and a weight decay coefficient of 0.0005. In complex classroom circumstances, these parameters improve detection performance while striking a compromise between training stability and convergence efficiency.

4.2. Evaluation Indicators

This article utilizes precision (P), recall (R), average precision (AP), and mean average precision (mAP) as the principal metrics to thoroughly assess the model’s efficacy in identifying student behavior in smart classrooms. Precision quantifies the accuracy of predictions, defined as the ratio of samples accurately identified as positive to the total samples anticipated as positive. The formula is as follows:

P = \frac{T P}{T P + F P} .

(9)

TP signifies true positive, and FP represents false positive. Recall measures the model’s capacity to identify instances of the positive class, defined as

R = \frac{T P}{T P + F N} .

(10)

Among these, FN denotes false negatives. AP represents the average precision of a single class across different confidence thresholds, calculated as follows:

A P_{i} = \int_{0}^{1} p (r) d r, i = 1, 2, \dots, N,

(11)

where N denotes the number of classes,

A P_{i}

represents the average precision of the ith class, and

p (r)

denotes the precision–recall curve. The mean average precision (mAP) is defined as the average of all class

A P

values:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} .

(12)

mAP@50 signifies the mean average precision at an intersection-over-union (IoU) threshold of 0.5, while mAP@50:95 is computed over an IoU spectrum from 0.5 to 0.95 in 0.05 increments, indicating a more stringent and thorough evaluation. This study utilizes GFLOPs and parameters to represent the model’s performance in terms of computational overhead and parameter scaling.

4.3. Attention Contrast Experiment

This study included attention mechanisms, such as EMA [29], SimAM [30], LSKA [31], CAA [32], and MHSA [33], prior to each detection head of YOLOv11n in classroom-based student behavior detection tests to assess their effect on model performance. It is important to note that the YOLOv11n baseline model (without any attention modules) serves as the reference for comparison in this experiment, corresponding to the first row in Table 3. Table 3 displays the findings of the experiment. With the mAP@50 falling to 88.8% and the mAP@50–95 to 72.9%, respectively, EMA showed a modest decline in accuracy but maintained steady parameter counts and computational complexity when compared to the baseline model (88.9% mAP@50, 73.1% mAP@50–95). This suggests that fine-grained characteristics were somewhat degraded by exponential smoothing. With the mAP@50 and mAP@50–95 dropping to 88.5% and 71.7%, respectively, SimAM’s overall performance was worse, suggesting a limited ability to represent small targets and fine-grained actions in complicated classroom situations.

With an mAP@50 of 89.5% and an mAP@50–95 of 73.9% while keeping essentially equal computational and parameter scales, LSKA outperformed both EMA and SimAM, demonstrating superior cross-scale modeling capabilities. Although CAA’s performance varied slightly (parameter count increased to 2.7M and GFLOPs increased to 6.8), it produced an mAP@50 and mAP@50–95 of 88.6% and 72.1%, respectively, with no appreciable improvement over baseline models. The most exceptional performance was shown by MHSA, which achieved an mAP@50 of 90.3% and an mAP@50–95 of 74.6%. It performed better than alternative mechanisms in complicated classroom situations in terms of detecting visually similar behaviors and overall detection accuracy, even with a minor increase in parameters (2.8M) and processing cost (8.0G). All things considered, EMA and SimAM show a stronger propensity for feature stability but only modest increases in detection accuracy. LSKA effectively enhances cross-scale modeling capabilities without increasing computational overhead. CAA shows relatively slight gains in performance. However, when it comes to accuracy and robustness, MHSA performs the best, making it the most promising attention mechanism for detecting student behavior in classroom situations. It is worth noting that several categories in the dataset exhibit a high degree of visual similarity, making them particularly challenging to distinguish in complex classroom scenarios. For example, “reading” and “listening” often involve similar sitting postures with subtle differences in eye and head orientation, while “raising hands” and “discussing” share similar upper body movements and partial occlusions. Traditional attention mechanisms tend to focus on local regions, making it difficult to disambiguate these fine-grained differences. In contrast, MHSA leverages global dependency modeling and positional bias encoding to enhance inter-class discriminability, which contributes significantly to its superior performance observed in Table 3.

4.4. Ablation Studies

This study carried out systematic ablation tests to thoroughly assess the contribution of the suggested enhancement modules to the baseline model YOLOv11n; the findings are shown in Table 4. The trials show that introducing each module separately results in notable performance boosts, but integrating numerous modules together produces the best results, showing the modules’ complementary nature and synergistic benefits.

The PMSA module, in particular, improved mAP@50 and mAP@50–95 to 90.1% and 74.3%, respectively, and raised model precision (P) from 85.6% to 87.5% and recall (R) from 84.0% to 85.0%. This suggests that PMSA improves contextual feature modeling and successfully reduces feature loss brought on by interference and occlusions in intricate classroom environments. At the same time, the SAH (Small Object Aware Head) architecture is introduced, which significantly simplifies the model while preserving detection accuracy. While the computational needs dropped from 2.6 GFLOPs to 2.4 GFLOPs and the number of parameters dropped from 6.4 million to 6.3 million, performance remained good (mAP@50 at 89.9%, mAP@50–95 at 73.6%). This illustrates how SAH maintains lightweight benefits while improving small item detection and fine-grained behavior recognition, which has applications in edge computing deployment.

The model’s capacity to simulate global reliance was significantly improved once MHSA modules were added prior to each detection head; mAP@50 and mAP@50–95 improved to 90.3% and 74.6%, respectively. This enhancement confirmed MHSA’s beneficial impact in global modeling and cross-category discrimination by significantly improving the model’s capacity to identify visually comparable categories like “raising hands” and “discussing,” in addition to increasing overall detection accuracy. In the end, the integrated model—which concurrently included PMSA, SAH (Small Object Aware Head), and MHSA—performed best on all metrics: mAP@50 and mAP@50–95 reached 91.5% and 75.5%, respectively, while precision (P) rose to 88.2% and recall (R) reached 85.6%. With only a slight increase in parameter count and computational complexity (from 6.4 M to 9.2 M parameters, and GFLOPs from 2.6 to 2.7), this model outperforms the baseline YOLOv11n by 2.6% in precision, 1.6% in recall, 2.6% in mAP@50, and 2.4% in mAP@50–95.

In conclusion, PMSA, SAH, and MHSA each exhibit distinct advantages in global dependency modeling, lightweight detection, and context modeling. Their combined use intensifies these individual benefits even more, producing a synergistic effect that greatly improves the model’s overall accuracy and efficiency. This result not only confirms that the suggested module design makes sense, but it also offers strong backing for the model’s use in large-scale, real-time behavior detection situations like smart classrooms.

4.5. Performance Comparison

In order to ensure real-time deployment feasibility in resource-constrained teaching environments, student behavior recognition in the intricate practical application scenario of smart classrooms requires not only high detection accuracy but also full consideration of model parameter scale and computational complexity. In order to do this, the suggested approach is methodically comparing with other popular detection models using the same experimental setup and datasets in this paper. Table 5 displays the outcomes of the experiment.

All things considered, conventional detection methods like SSD and Faster R-CNN met detection requirements for specific static scenarios with mAP@50 scores of 81.5% and 84.2%, respectively. However, they are ineffective for real-time applications in smart classrooms because of their large parameter sizes (24.5 M and 41.3 M) and high computational complexity (31.5 GFLOPs and 189 GFLOPs). The YOLO series algorithms, on the other hand, exhibit a better balance between computing efficiency and detection accuracy due to their lightweight design. YOLOv11, the most recent baseline model, reaches 88.9% mAP@50 and 73.1% mAP@50–95, setting new records for detection performance. Its consistency in complicated action recognition is still lacking, though. With its self-attention mechanism, DETR exhibits powerful global modeling capabilities. Large-scale adoption in classroom teaching situations is hampered by its computing demand of 58.6 GFLOPs and enormous parameter count of 20.1 million, which leads to deployment costs that are much greater than those of the YOLO series.

Conversely, the suggested enhanced model exhibits all-encompassing performance benefits, including an mAP@50 of 91.6%, recall (R) of 85.6%, precision (P) of 88.2%, and an mAP@50–95 of 75.7%. Our method delivers a 2.6% increase in mAP@50-95 and a 3.0% improvement in accuracy when compared to the YOLOv11 baseline, with only slight increases in processing cost and parameter count. Additionally, our method reduces the number of parameters by about 86.6% and the computational complexity by about 84.3%, while improving accuracy by 4.6% and 3.6%, respectively, when compared to DETR. This preserves excellent precision while drastically reducing resource use. These outcomes show that our method is the best option for large-scale, low-latency intelligent classroom behavior detection since it not only achieves accuracy breakthroughs but also demonstrates exceptional benefits in complexity management.

These performance improvements can be attributed to the joint contribution of three modules integrated into the network. CSP-PMSA enhances the representation of salient features and contextual cues in cluttered classroom scenes, effectively mitigating background interference and occlusion effects. SAH unifies multi-scale channels and adaptively adjusts their weights, significantly improving the detection of small and fine-grained behaviors in the rear seating areas. MHSA, introduced before each detection head, captures long-range dependencies and positional bias across spatial subspaces, enhancing global perception and improving the distinction between visually similar categories. As shown in Table 6, the proposed method improves AP@50 for “reading” and “listening” by 4.5% and 5.5%, respectively, relative to YOLOv11, and achieves 3.5% and 4.0% gains for “raising hands” and “discussing”. These improvements confirm that MHSA effectively captures subtle differences in posture, gestures, and orientation, thereby reducing confusion between similar categories and boosting fine-grained behavior recognition.

There are noticeable variations in detection performance among the eight student behavior classes at the category level (refer to Table 6 and Figure 8). For static behaviors like “writing” and “reading,” SSD and Faster R-CNN exhibit respectable accuracy; however, for small object or dynamic categories like “turning around” and “guiding,” their precision significantly decreases. Overall performance gains are shown by YOLOv3-Tiny, YOLOv6, and YOLOX; however, there is still difficulty differentiating between comparable categories, such as “raising hand” and “discussing.” While DETR excels in large-motion categories like “standing,” it demonstrates significant variability in small object identification. YOLOv11, on the other hand, has a fairly balanced performance but lacks stability for several fine-grained actions. Conversely, the suggested model consistently demonstrates good accuracy in all eight behavioral categories, with a notable emphasis on fine-grained action identification, such as “raising hands” and “discussing.” In challenging classroom contexts, the mAP@50 curve continuously outperforms other models, exhibiting higher flexibility and discriminative capability.

Notably, the suggested model shows better consistency and generalization capacities across behavioral categories in addition to striking a balance between detection performance and efficiency. This makes it possible to better adapt to a variety of classroom environments, better identify intricate interactive behaviors, and enable dynamic evaluation of student involvement, attentiveness, and classroom interaction patterns in educational settings. In conclusion, the suggested approach shows great application potential and usefulness by achieving high accuracy, low computational complexity, and strong robustness in smart classroom behavior detection tasks.

4.6. Model Performance Analysis of Deep Learning for Student Behavior Detection in Smart Classroom Environments

The YOLOv11n model’s F1 confidence curve is shown in Figure 9. The findings show that while the model performs consistently overall across the majority of behavioral categories, it still has issues interpreting small targets and fine-grained behaviors. For example, YOLOv11n’s poor F1 ratings for the “turning around” and “guiding” actions show that it is not sensitive enough to local features and occlusions in complicated classroom environments. Additionally, the model’s practical applicability is limited due to its high rates of false positives and false negatives for dynamic actions and visually similar categories, even while it achieves adequate recognition accuracy for typical static behaviors (such as “writing” and “reading”).

On the other hand, the F1 confidence curve for the suggested enhanced model is shown in Figure 10. Greater stability and smoothness in the overall curve show that the model retains good precision and recall across a range of confidence criteria. Notably, the updated model shows noticeably higher F1 scores for categories like “writing” and “raising hand,” which strengthens the ability to recognize small objects and enhances the recognition of fine-grained behaviors. More significantly, the enhanced model outperforms YOLOv11n in terms of discriminating in readily confused comparable categories (e.g., “raising hand” versus “discussing”). This result confirms that the CSP-PMSA, SAH, and MHSA modules that were introduced have complementary benefits when it comes to modeling features at various hierarchical levels.

Figure 11 and Figure 12 use normalized confusion matrices to better show the differences between before and after improvement. An overall improvement in classification accuracy is indicated by the enhanced model’s general increase in diagonal components across behavioral categories. The most notable improvements were seen in actions that were prone to occlusion or small scale: turning around saw a diagonal value increase from 0.61 to 0.71 (+0.10), raising hand from 0.69 to 0.78 (+0.09), standing from 0.93 to 0.95 (+0.02), discussing from 0.89 to 0.92 (+0.03), and guiding from 0.87 to 0.91 (+0.04). Additionally, categories that were visually stable or static continued to increase steadily: reading improved from 0.85 to 0.86 (+0.01), writing was steady at 0.95, and listening climbed from 0.95 to 0.96 (+0.01). This unique enhancement closely matches the three modules’ functionalities: by modeling global dependencies and positional bias across multiple subspaces, MHSA improves discrimination between similar categories (e.g., discussing and standing); PMSA suppresses background interference in context modeling, guaranteeing high consistency for categories like listening and standing in complex scenarios; and SAH efficiently improves small-scale and local-scale features, greatly improving detection of turning around and raising hand.

Interestingly, the model’s improvements are further validated by convergence in misclassification pairings. The misclassification rate for reading→guiding dropped from 0.35 to 0.34 (−0.01), listening→turning around dropped from 0.21 to 0.18 (−0.03), and listening→guiding dropped from 0.33 to 0.32 (−0.01). Furthermore, false positives caused by the background were much reduced, with the bottom row “background” often lowering interference in all categories. For example, the guiding column reduced from 0.11 to 0.09 (−0.02), the turning around column from 0.06 to 0.05 (−0.01), the reading column from 0.03 to 0.02 (−0.01), and the standing column from 0.05 to 0.04 (−0.01). SAH’s function in scale alignment (reducing misclassification of small targets with adjacent categories), PMSA’s advantage in context filtering (effectively reducing false triggers in complex backgrounds), and MHSA’s global consistency modeling capability (reducing confusion between semantically similar categories) are all reflected in these changes.

The classroom detection results from the YOLOv11 baseline model are shown in Figure 13. For common static behaviors like “reading” and “listening,” this model reliably generates stable boundary boxes. It has trouble with small actions and occlusions, though, which can result in missing detections or inconsistent labeling. A “writing” activity, for example, is not detected in Figure 12, demonstrating the model’s inability to reliably identify fine-grained behaviors in busy classroom settings.

On the other hand, the detection outcomes from our suggested model are displayed in Figure 14. With more accurate bounding boxes and higher confidence scores, this improved model performs noticeably better. Partial occlusions are handled by the model with ease, and it correctly detects delicate actions like “writing.” Additionally, Figure 14 shows that our model produces a cleaner and more dependable detection output than the baseline since it shows fewer duplicate boxes and misclassifications. In conclusion, the enhanced model outperforms the YOLOv11 baseline model in real-world classroom scenarios, as seen in Figure 14, which also shows higher performance in handling occlusions and recognizing complicated behaviors.

4.7. System Processes and Applications

This study’s methodical workflow, depicted in Figure 15, includes the complete process from model training to edge deployment. Model training and optimization are initially performed on the server side via the PyTorch 2.3.1 framework. The trained model is subsequently exported in the Open Neural Network Exchange (ONNX) format. ONNX, being an open intermediate representation format, facilitates efficient model transfer and inference deployment across many deep learning frameworks and hardware platforms, hence ensuring model compatibility and scalability in practical applications.

The exported ONNX model is implemented on the Jetson Orin Nano Super platform. This architecture, engineered by NVIDIA for edge AI applications, incorporates a 1024-core CUDA GPU that provides up to 32 TOPS of AI processing capability. The low power consumption and high integration facilitate deep learning inference in resource-limited settings. This embedded device captures classroom video streams in real-time using cameras. Subsequent to preprocessing, the data is sent into the model for inference. The detection results are presented on monitoring terminals and are simultaneously transmitted to the teacher management system for instructional behavior analysis and decision-making support.

In practical deployment, the system achieves a stable frame rate of approximately 23 FPS under the standard PyTorch runtime. After applying TensorRT acceleration, the inference speed increases to 56 FPS with an average power consumption ranging between 7 W and 15 W, which satisfies real-time requirements in typical classroom environments while maintaining energy efficiency. Latency measurements indicate smooth operation with no significant delays in detection and feedback, confirming the feasibility of real-time deployment on embedded hardware.

Beyond technical metrics, the system’s integration into real classroom environments highlights its practical relevance. It allows educators to monitor classroom engagement, analyze behavioral trends, and provide timely feedback without excessive computational or energy costs. Nevertheless, ethical considerations such as privacy protection and responsible use of student monitoring technologies are essential. In future applications, ensuring compliance with data protection regulations and implementing privacy-preserving strategies (e.g., anonymization, on-device processing) will be critical to maintaining ethical standards in educational settings.

The experiments indicate that the system functions reliably on the Jetson Orin Nano Super platform while predominantly satisfying real-time needs. The findings confirm the suggested method’s viability in real classroom environments and demonstrate that the integration of advanced deep learning models with edge computing hardware offers a feasible technical approach for smart classroom advancement.

4.8. Discussion

This study demonstrates that a task-specific architectural design tailored to classroom environments can significantly enhance the accuracy and robustness of student behavior detection. By integrating contextual enhancement, scale-aware perception, and global dependency modeling, the proposed approach achieves stable performance in the presence of background clutter, occlusions, and inter-class similarity, while maintaining computational efficiency suitable for real-time deployment in educational scenarios. These results suggest that lightweight structural optimization is a promising direction for practical classroom applications, enabling efficient behavioral analysis without compromising detection precision.

Despite these promising outcomes, several limitations remain. The STBD-08 dataset used in this study originates from a single region, which may limit the model’s generalization across different cultural and environmental settings. Additionally, class imbalance, such as the dominance of the “listening” category, may affect the detection of less represented behaviors. Future work will focus on expanding data diversity across regions and environments, incorporating reweighting strategies, advanced augmentation, or semi-supervised learning to address imbalance issues. Further attention will be given to privacy and ethical considerations related to student monitoring to ensure secure and responsible deployment in real educational settings.

5. Conclusions

Identifying and analyzing student classroom behavior is essential for enhancing instructional feedback and improving learning outcomes. This work proposes an enhanced YOLOv11-based detection framework that incorporates CSP-PMSA, SAH, and MHSA modules to effectively address background interference, occlusions, and inter-class similarity. The experimental results demonstrate that the proposed model achieves superior accuracy, recall, and mAP compared to mainstream methods, offering a practical and efficient solution for real-time multi-class student behavior detection in smart classrooms.

Future research will emphasize multimodal information integration, combining visual, auditory, and textual data to better understand student engagement and emotional states. In parallel, model compression, quantization, and edge deployment will be explored to further enhance efficiency in resource-constrained environments. Collaboration with educators will be pursued to refine behavioral categories and evaluation protocols, ensuring that the model remains scalable, interpretable, and applicable to real classroom scenarios.

Author Contributions

Conceptualization, J.W.; Methodology, J.W. and S.T.; Software, S.T.; Validation, J.W. and Y.S.; Formal Analysis, S.T.; Investigation, J.W.; Resources, J.W.; Data Curation, S.T.; Writing—Original Draft Preparation, J.W. and S.T.; Writing—Review and Editing, J.W., S.T. and Y.S.; Visualization, J.W., S.T. and Y.S.; Supervision, J.W.; Project Administration, J.W.; Funding Acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Program of the Hubei Provincial Department of Education (grant number B2023417) and the Hubei Province First-Class Undergraduate Course Construction Project.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Wuhan Qingchuan University (24 October 2025).

Informed Consent Statement

This study involves the analysis of anonymized or de-identified classroom behavioral data and does not include any medical intervention or activities that may infringe on participants’ rights or privacy. All procedures comply with the principles outlined in the Declaration of Helsinki (1975, revised 2013) and the relevant institutional and national ethical standards.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, Y.; Chen, L.; He, W.; Sun, D.; Salas-Pilco, S.Z. Artificial intelligence for enhancing special education for K-12: A decade of trends, themes, and global insights (2013–2023). Int. J. Artif. Intell. Educ. 2024, 35, 1129–1177. [Google Scholar] [CrossRef]
Hu, J.; Huang, Z.; Li, J.; Xu, L.; Zou, Y. Real-time classroom behavior analysis for enhanced engineering education: An AI-assisted approach. Int. J. Comput. Intell. Syst. 2024, 17, 167. [Google Scholar] [CrossRef]
Liu, Q.; Jiang, X.; Jiang, R. Classroom behavior recognition using computer vision: A systematic review. Sensors 2025, 25, 373. [Google Scholar] [CrossRef]
Zhao, X.-M.; Yusop, F.D.B.; Liu, H.-C.; Prilanita, Y.-N.; Chang, Y.-X. Classroom student behavior recognition using an intelligent sensing framework. IEEE Access 2025, 13, 49767–49776. [Google Scholar] [CrossRef]
Liu, S.; Zhang, J.; Su, W. An improved method of identifying learner’s behaviors based on deep learning. J. Supercomput. 2022, 78, 12861–12872. [Google Scholar] [CrossRef]
Zheng, L.; Wang, C.; Chen, X.; Song, Y.; Meng, Z.; Zhang, R. Evolutionary machine learning builds smart education big data platform: Data-driven higher education. Appl. Soft Comput. 2023, 136, 110114. [Google Scholar] [CrossRef]
Li, G.; Liu, F.; Wang, Y.; Guo, Y.; Xiao, L.; Zhu, L. A convolutional neural network (CNN) based approach for the recognition and evaluation of classroom teaching behavior. Sci. Program. 2021, 2021, 6336773. [Google Scholar] [CrossRef]
Feng, C.; Luo, Z.; Kong, D.; Ding, Y.; Liu, J. IMRMB-Net: A lightweight student behavior recognition model for complex classroom scenarios. PLoS ONE 2025, 20, e0318817. [Google Scholar] [CrossRef] [PubMed]
Han, L.; Ma, X.; Dai, M.; Bai, L. A WAD-YOLOv8-based method for classroom student behavior detection. Sci. Rep. 2025, 15, 9655. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Zhou, G.; Jiang, H. Student behavior detection in the classroom based on improved YOLOv8. Sensors 2023, 23, 8385. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Amatari, V.O. The instructional process: A review of Flanders’ interaction analysis in a classroom setting. Int. J. Second. Educ. 2015, 3, 43–49. [Google Scholar] [CrossRef]
Wang, D.; Han, H.; Liu, H. Analysis of instructional interaction behaviors based on OOTIAS in smart learning environment. In Proceedings of the 2019 Eighth International Conference on Educational Innovation through Technology (EITT), Biloxi, MS, USA, 6–8 December 2019; pp. 147–152. [Google Scholar]
Sarabu, A.; Santra, A.K. Human action recognition in videos using convolution long short-term memory network with spatio-temporal networks. Emerg. Sci. J. 2021, 5, 25–33. [Google Scholar] [CrossRef]
Pang, C.; Lu, X.; Lyu, L. Skeleton-based action recognition through contrasting two-stream spatial-temporal networks. IEEE Trans. Multimed. 2023, 25, 8699–8711. [Google Scholar] [CrossRef]
Varshney, N.; Bakariya, B. Deep convolutional neural model for human activities recognition in a sequence of video by combining multiple CNN streams. Multimed. Tools Appl. 2022, 81, 42117–42129. [Google Scholar] [CrossRef]
Wang, Z.; Yao, J.; Zeng, C.; Li, L.; Tan, C. Students’ classroom behavior detection system incorporating deformable DETR with Swin transformer and light-weight feature pyramid network. Systems 2023, 11, 372. [Google Scholar] [CrossRef]
Peng, S.; Zhang, X.; Zhou, L.; Wang, P. YOLO-CBD: Classroom behavior detection method based on behavior feature extraction and aggregation. Sensors 2025, 25, 3073. [Google Scholar] [CrossRef] [PubMed]
Zhu, W.; Yang, Z. CSB-YOLO: A rapid and efficient real-time algorithm for classroom student behavior detection. J. Real-Time Image Process. 2024, 21, 140. [Google Scholar] [CrossRef]
Chen, H.; Guan, J. Teacher–student behavior recognition in classroom teaching based on improved YOLO-v4 and Internet of Things technology. Electronics 2022, 11, 3998. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Z.; Chen, X.; Dai, L.; Li, Z.; Zong, X.; Liu, T. Classroom behavior recognition based on improved YOLOv3. In Proceedings of the 2020 International Conference on Artificial Intelligence and Education (ICAIE), Hangzhou, China, 26–28 June 2020; pp. 93–97. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27706–27716. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 16519–16529. [Google Scholar]

Figure 1. Distribution of labels across rows in the dataset.

Figure 2. Visualization of entire dataset.

Figure 3. Network architecture of YOLOv11 baseline model.

Figure 4. Network architecture of proposed algorithm.

Figure 5. CSP-PMSA module architecture.

Figure 6. Schematic diagram of SAH module structure.

Figure 7. Schematic diagram of MHSA module structure.

Figure 8. Detection comparison results across eight student behavior categories for each model.

Figure 9. F1 confidence curve of YOLOv11n baseline model.

Figure 10. F1 confidence curve of the improved model.

Figure 11. Normalized confusion matrix of YOLOv11n baseline model.

Figure 12. Normalized confusion matrix of improved model.

Figure 13. Classroom detection results with YOLOv11n baseline model.

Figure 14. Classroom detection results with improved model.

Figure 15. Comprehensive procedure of smart classroom student behavior detection system.

Table 1. Experimental environment configuration.

Configurations	Parameters
CPU	x86_64
GPU	NVIDIA GeForce RTX 3090 (24 GB)
RAM	48 GB
Operating System	Ubuntu 22.04.3 LTS
Python	3.10.12
CUDA	12.8
Torch	2.3.1

Table 2. Model training hyperparameter settings.

Parameters	Setup
Epochs	50
Image Size	640
Batch Size	16
Optimizer	SGD
Learning Rate	0.01
Weight Decay	0.0005
Warmup Epochs	3
Momentum	0.937
Scheduler	Cosine
Loss Function	CIoU + BCE
Regularization	L2 + Label Smoothing

Table 3. Comparison of attention mechanisms.

Method	mAP@50 (%)	mAP@50–95 (%)	Parameters (M)	GFLOPs (G)
YOLOv11n	88.9	73.1	2.6	6.4
+EMA	88.8	72.9	2.6	6.6
+SimAM	88.5	71.7	2.6	6.4
+LSKA	89.5	73.9	2.6	6.6
+CAA	88.6	72.1	2.7	6.8
+MHSA	90.3	74.6	2.8	8.0

Table 4. Comparison of ablation experiment results.

Model	P (%)	R (%)	mAP@50 (%)	mAP@50–95 (%)	Param (M)	GFLOPs (G)
YOLOv11n	85.6	84.0	88.9	73.1	6.4	2.6
YOLOv11n + PMSA	87.5	85.0	90.1	74.3	7.8	2.6
YOLOv11n + SAH	86.1	84.4	89.9	73.6	6.3	2.4
YOLOv11n + MHSA	86.0	84.4	90.3	74.6	8.0	2.8
Ours	88.2	85.6	91.5	75.5	9.2	2.7

Table 5. Comparison of detection results across different models.

Model	P (%)	R (%)	mAP@50 (%)	mAP@50–95 (%)	Param (M)	GFLOPs (G)
SSD	79.5	69.5	81.5	68.1	24.5	31.5
Faster-RCNN	81.8	71.5	84.2	70.3	41.3	189
YOLOv3-Tiny	82.5	70.5	86.1	71.9	11.6	18.9
YOLOv6	84.0	72.5	87.0	71.6	4.0	11.8
YOLOX	83.8	72.8	86.4	71.5	9.0	26.8
YOLOv11	85.6	84.0	88.9	73.1	2.6	6.4
RT-DETR	84.5	72.5	87.0	72.1	20.1	58.6
Ours	88.2	85.6	91.6	75.7	2.7	9.2

Table 6. Per-class AP@50 across eight student behavior categories for different models.

Model	Writ.	Read.	List.	TA	Raise.	Stand.	Disc.	Guid.
SSD	85.0	83.0	84.0	76.0	78.0	84.0	82.0	80.0
Faster-RCNN	87.0	86.0	87.0	83.0	82.0	87.0	86.0	75.6
YOLOv3-Tiny	88.5	86.5	88.0	79.5	81.0	89.0	87.0	89.3
YOLOv6	91.0	91.0	92.0	79.0	81.0	89.0	87.0	86.0
YOLOX	89.0	87.0	88.0	78.0	80.0	91.0	90.0	88.2
YOLOv11	90.5	89.5	91.0	82.5	84.5	90.0	89.0	94.2
DETR	90.0	92.0	90.5	80.0	82.0	88.0	86.0	87.5
Ours	96.0	94.0	96.5	86.0	88.0	95.0	93.0	84.3

Note: values represent AP@50 (%) for each class. Writ. = writing; Read. = reading; List. = listening; TA = turning around; Raise. = raising hand; Stand. = standing; Disc. = discussing; Guid. = guiding.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Sun, Y.; Tian, S. Deep Learning for Student Behavior Detection in Smart Classroom Environments. Information 2025, 16, 949. https://doi.org/10.3390/info16110949

AMA Style

Wang J, Sun Y, Tian S. Deep Learning for Student Behavior Detection in Smart Classroom Environments. Information. 2025; 16(11):949. https://doi.org/10.3390/info16110949

Chicago/Turabian Style

Wang, Jue, Yuchen Sun, and Shasha Tian. 2025. "Deep Learning for Student Behavior Detection in Smart Classroom Environments" Information 16, no. 11: 949. https://doi.org/10.3390/info16110949

APA Style

Wang, J., Sun, Y., & Tian, S. (2025). Deep Learning for Student Behavior Detection in Smart Classroom Environments. Information, 16(11), 949. https://doi.org/10.3390/info16110949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Student Behavior Detection in Smart Classroom Environments

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Behavior Detection in Classroom Scenarios

3. Materials and Methods

3.1. Dataset Production

3.2. YOLOv11n

3.3. Proposed Method

3.3.1. CSP-PMSA Module

3.3.2. SAH Module

3.3.3. MHSA Module

4. Experimental Results and Analysis

4.1. Experimental Settings and Model Training

4.2. Evaluation Indicators

4.3. Attention Contrast Experiment

4.4. Ablation Studies

4.5. Performance Comparison

4.6. Model Performance Analysis of Deep Learning for Student Behavior Detection in Smart Classroom Environments

4.7. System Processes and Applications

4.8. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI