2.3. Adaptive Spatial Pyramid Network
As shown in
Figure 3, the proposed Adaptive Spatial Pyramid Network module is the core innovation in the Neck part of the YOLOv11-ASV model. Its design goal is to enhance multi-scale feature fusion and context modeling capabilities to better address the challenges of diverse behaviors, significant scale variations, and fine-grained action recognition in classroom scenarios.
The design of ASPN is inspired by the typical pyramid feature fusion structure but introduces chunk-wise splitting and selective fusion mechanisms. Multi-layer features are divided into several channel sub-blocks, and within each sub-block, low-level details and high-level semantics are aligned and fused. ASPN extends this idea to a pyramid structure with multi-scale convolution branches, achieving a balanced modeling of context and details.
In terms of module structure, let the aligned multi-layer features be {F_low, F_mid, F_high}, which are divided into K sub-blocks along the channel dimension:
Within each sub-block, selective interaction is performed:
where φ(·) is the selective fusion function, and its specific implementation is detailed in the following sections. After all sub-blocks are fused, they are concatenated along the channel dimension and reconstructed through convolution:
2.3.1. Block Internal Functionality
Each sub-block executes the selective fusion function from Equation (2). This function dynamically aggregates the multi-level input features , , and through a gating mechanism.
The implementation of
is as follows: first, the three input features are concatenated. A context vector is then generated by applying Global Average Pooling (GAP) to this concatenation. This vector is fed into a small network (a fully-connected layer followed by a Softmax activation) to produce a set of adaptive weights
, which represent the importance of each feature level. The final fused output for the block is the element-wise weighted sum:
This design enables each sub-block to autonomously calibrate its reliance on low-level details versus high-level semantics, which is critical for distinguishing fine-grained behavioral categories.
Each sub-block corresponds to an independent block (block 1-block K). Inside the block, the input sub-feature extracts contextual information under different receptive fields through multi-scale convolution branches (e.g., 3 × 3, 5 × 5 dilated convolutions, etc.). The results from each branch are aggregated within the block and then compressed to channels via a 1 × 1 convolution.
This allows each block to independently learn the “context–detail” balance that best fits its sub-channel: blocks containing small targets tend to focus on low-level details, while blocks with complex postures or background interference rely more on high-level semantics. Finally, the outputs of all blocks are concatenated along the channel dimension to form the overall enhanced feature.
2.3.2. Multi-Scale Convolution Branches
The processing of each block can be formalized as:
where
represents the corresponding convolution operation. This design explicitly expands the effective receptive field, enhancing robustness to targets of different scales.
2.3.3. Fusion and Residual Enhancement
The outputs of all blocks are concatenated and then integrated through a 1 × 1 convolution, followed by adding the aligned residual features X. The final output is obtained through batch normalization (BN) and SiLU activation:
where σ represents the SiLU activation function. The above process forms a closed loop of chunking → independent modeling → concatenation and fusion → residual enhancement.
2.3.4. Module Configuration
The ASPN module was configured with four channel blocks (K = 4). Within each block, multi-scale contextual information was extracted using parallel convolutional branches with kernel sizes of 3, 5, and 7. The 5 × 5 and 7 × 7 convolutions employed dilation rates of 2 and 3 respectively to expand the receptive field. Input feature maps to the ASPN at the P3, P4, and P5 levels had channel dimensions of 128, 256, and 512 correspondingly. The selective fusion function φ(·) generated adaptive weights through a compact network comprising global average pooling, a fully-connected layer with 16 hidden units, and a Softmax output layer.
2.3.5. Method Advantages
The advantages of ASPN are as follows: Fine-grained modeling: Independent processing at the block level allows each channel sub-block to flexibly select high-level semantics or low-level details. Multi-scale enhancement: Parallel convolution branches expand the receptive field, improving robustness to small targets and occluded scenes. Adaptive aggregation: Sub-blocks can learn optimal fusion strategies, improving the ability to distinguish complex actions. Lightweight and scalable: The number of blocks K and the scale of the branches are adjustable, facilitating a trade-off between accuracy and efficiency.
2.4. Collaboration Between VanillaNet and ASPN
This paper combines VanillaNet-5 with the ASPN module to improve performance in classroom behavior recognition. This choice is strategically designed for the classroom environment. VanillaNet’s minimalist, sequential architecture, devoid of complex branches, enhances the clarity of feature maps and ensures more direct gradient flow. This design is particularly beneficial for addressing motion blur and accurately localizing actions under partial occlusion, common challenges in classroom videos. Additionally, its structural re-parameterization strikes an optimal balance, offering rich non-linearity during training for robust feature learning, while seamlessly merging into a highly efficient network for real-time inference, an essential requirement for smart classroom applications.
The VanillaNet-5 backbone was employed with a width multiplier of 1.0, resulting in an initial channel size of 32. This configuration provides a foundational feature hierarchy where the channel dimensions progressively double across its five stages, effectively balancing representational capacity with computational efficiency for the task.
The core concept of the VanillaNet deep training strategy is to modify the network structure in the early stages of training. Unlike traditional methods, where only a single convolution layer is trained, VanillaNet optimizes two convolution layers and the activation function between them at the start of training. As the training process progresses, the activation function gradually degenerates into an identity mapping through a weighted mechanism. After training, due to the identity transformation property of the activation function, the two convolution layers can be merged into one using a structural parameterization method, thereby improving the model’s inference speed without sacrificing performance.
Specifically, for any activation function, it is linearly combined with an identity mapping, defined as follows:
where
λ is a hyperparameter used to balance the degree of non-linearity. Let
e and
E represent the current training epoch and total training epochs, respectively, then
λ =
e/
E. Therefore, in the early stages of training (
e ≈ 0),
λ ≈ 0, and the modified activation function A
′(
x) approximates the original function A(
x), showing strong non-linearity. As the training nears completion (
e ≈
E),
λ ≈ 1, and A
′(
x) approaches the identity mapping
x, effectively removing the non-linearity between the two convolution layers and creating the conditions for layer merging.
Since simple network structures are often limited by insufficient non-linear expressiveness, VanillaNet employs a parallel function stacking strategy to enhance the non-linearity of a single activation layer, thereby significantly improving the non-linear expression ability of each activation function. This approach constructs a more powerful composite activation function by weighted summing multiple base activation functions, formally defined as:
where
n represents the number of stacked functions, and
ai and
bi are the scale weights and biases of the
i-th activation function, respectively. This structure significantly enhances the nonlinear fitting capability of the activation function.
To further enhance its approximation ability, VanillaNet draws inspiration from BNET, enabling the activation function of this series to utilize local neighborhood information of the input features to learn global context. Specifically, for a particular position (
h,
w,
c) on the input feature map, the improved activation function is defined as:
In experiments, VanillaNet demonstrated strong detection for actions with large movements and clear boundaries, such as standing or raising hands. For feature extraction of these actions, VanillaNet effectively extracts key features and enables accurate detection. However, for similar behaviors that are easy to confuse, such as reading, bowing the head, and writing, its simplified structure limits feature representation, reducing its ability to distinguish subtle differences and impacting performance.
To address this issue, we combine VanillaNet-5 with the ASPN module. ASPN’s multi-scale contextual enhancement and chunked feature fusion overcome VanillaNet’s limitations in fine-grained behavior modeling. It dynamically weights features at different scales, improving recognition of similar actions, such as bowing the head and writing, in complex classroom scenarios. This combination retains VanillaNet’s efficiency while enhancing feature representation for high accuracy. VanillaNet-5 performs feature downsampling and channel multiplication at multiple stages using a “1 × 1 convolution + MaxPool” structure, generating multi-scale feature maps for ASPN. Through adaptive weighting, ASPN integrates low-level details and high-level semantic features, further enhancing the network’s ability to perceive and distinguish fine-grained classroom behaviors. During inference, VanillaNet-5 increases feature dimensions at each stage via channel multiplication, capturing more detailed information, while ASPN fuses multi-scale features effectively. This optimization enhances YOLOv11’s detection accuracy for fine-grained actions and improves robustness in complex classroom environments, further advancing its potential in classroom behavior recognition tasks.