YOLOv8 [
29], released by Ultralytics in 2023, offers five different-scale versions—N, S, M, L, X—based on network depth, width, and maximum channel count, enabling adaptation to various deployment platforms and scenarios. Among these, YOLOv8n features the fewest parameters and minimal floating-point operations, delivering high efficiency ideal for real-time deployment while maintaining strong accuracy. Consequently, this paper adopts YOLOv8n as the baseline model, whose network architecture and key components are illustrated in
Figure 1.
YOLOv8 abandons traditional residual structures, integrating C3 modules with ELAN concepts to introduce C2F modules. These modules process and fuse input features through segmentation, enhancing gradient information flow and feature extraction capabilities while maintaining high computational efficiency. The Neck section adopts the PAN-FPN architecture to integrate multi-scale features and introduces SPPF modules to expand receptive fields through fast serial pooling. The Head employs the Decoupled-Head architecture to separate classification and regression tasks. For the loss function, the regression branch innovatively combines DFL with CIOU loss, jointly improving target localization accuracy.
Despite its outstanding performance, YOLOv8 still faces numerous challenges in pedestrian detection within complex scenarios. These include heightened demands for lightweight and real-time capabilities in resource-constrained environments, as well as the need to improve detection accuracy and stability in crowded areas and scenes with complex lighting conditions. Therefore, this paper proposes the YOLO-SGCF algorithm as an enhancement to the YOLOv8 model, with the following specific optimizations:
2.1.1. Swin Transformer
Swin Transformer [
30] is a Transformer-based visual backbone network proposed in 2021. Compared to traditional convolutional neural networks (CNNs), Swin Transformer can more flexibly model global dependencies and demonstrates outstanding performance across multiple visual tasks.
In real-world scenarios, pedestrians vary in distance from the camera, leading to significant scale differences in their representations within images. Conventional CNNs may lose fine-grained details of small-scale pedestrians or fail to fully capture features of large-scale pedestrians when processing objects at different scales. The Swin Transformer’s multi-scale feature fusion capability effectively addresses this scale variation issue, enhancing the detection of abnormal behaviors across pedestrians of varying sizes. The structural diagram is shown in
Figure 3.
The entire model adopts a hierarchical design comprising four stages. The Patch Partition layer primarily handles preprocessing of the input image, segmenting the original image into non-overlapping patches. These image blocks are then unfolded into one-dimensional vectors and mapped to a high-dimensional feature space via a linear embedding operation, while simultaneously incorporating position embeddings. The Patch Merging layer implements downsampling to construct the hierarchical feature representation of the Swin Transformer. It concatenates the features of each set of adjacent patches and applies a linear layer to the resulting high-dimensional features. Subsequently, a Swin Transformer Block performs feature transformation, maintaining a resolution of . The patch merging and feature transformation in the first block constitute “Stage 2.” This process repeats twice as “Stage 3” and “Stage 4.”
The Swin Transformer Block typically appears as a two-stage serial structure. In the first stage, Window-based Multi-headed Self-Attention (W-MSA) is employed to divide the feature map into multiple non-overlapping windows. Self-attention is computed only within each window, significantly reducing computational complexity. The W-MSA module first divides the feature map into multiple windows. Assuming each window has dimensions M in height and width, this yields a total of
windows. A multi-headed attention module is then applied within each window. The computational complexity for processing a feature map with height h, width w, and C channels is as follows:
Since there are
windows, the calculation formula is as follows:
The second stage employs Shifted Window Multi-Head Self-Attention (SW-MSA). By shifting the window boundaries, it overcomes the limitations of the W-MSA module, enabling cross-window information exchange to effectively capture global feature dependencies. Each attention module is followed by a standard structure comprising an MLP, a residual connection, and layer normalization. The MLP further extracts high-level features through nonlinear transformations; the residual connection directly adds the module’s input and output, effectively mitigating gradient vanishing and aiding deep network training; layer normalization stabilizes the training process and accelerates convergence by normalizing the feature distribution.
Overall, the Swin Transformer balances modeling capabilities with the efficiency demands of visual tasks through Window Attention and Shifted Window Attention. It preserves global context modeling capabilities while significantly enhancing computational efficiency for visual tasks.
2.1.3. Attention
CA (Coordinate Attention Mechanism) [
32] is an attention mechanism designed for CNN. While traditional attention mechanisms typically calculate attention weights based on relative or absolute position information, CA introduces coordinate position information as the basis for attention computation. It primarily addresses how to better capture spatial information and inter-channel dependencies within the network.
The core concept of CA is to take the coordinates of samples as input and compute attention weights based on the relative distances between different positions in space. These attention weights can be used to weight different positions within the input feature map, thereby obtaining more locally specific feature representations. The specific process is illustrated in
Figure 5.
The input feature map is
. It first passes through a residual module, which preserves the fundamental information of the original features while providing input for subsequent attention calculations. Horizontal (X AvgPool) and vertical (Y AvgPool) average pooling are then performed separately. Taking X AvgPool as an example, global average pooling is applied along the horizontal dimension (width W) of the input feature map, yielding the horizontal feature vector
. The formula is:
where
,
.
The horizontally pooled
and vertically pooled
are concatenated along the channel dimension (Concat), yielding the concatenated feature. Then, a 1 × 1 convolution (Conv2d) operation is performed on this feature to fuse horizontal and vertical information. Batch normalization (Batch Norm) is then applied to the convolved feature to accelerate training and enhance stability. Then, it is passed through a non-linear activation function, such as ReLU, to obtain the activated feature
(where
is the reduction rate, used to reduce computational load).
is then separated into a horizontal feature
and vertical feature
, a 1 × 1 convolution operation is performed, and the Sigmoid activation function is applied to each feature to obtain the horizontal attention weight
and vertical attention weight
. Each weight is multiplied element-wise with the residual feature of the original input; the results are summed, and finally passed through the residual module to obtain the output feature
. The formula is as follows:
Unlike previous approaches that convert 2D global pooling into a single feature vector, CA decomposes channel attention into 1D feature encoding with bidirectional aggregation. This approach captures long-range dependencies and precise positional information along two distinct directions, generating complementary feature maps to enhance target representations. This method avoids the loss of positional information inherent in 2D global pooling, thereby more effectively improving network performance.
In response to this characteristic, this paper embeds the CA into the backbone of YOLOv8. This allows the network to focus more on target-related regions and channel features within images, enabling it to capture key features of abnormal behavior more accurately while suppressing interference from irrelevant background information. Simultaneously, it enables the subsequent Neck and Head components to perform object bounding box regression and category prediction based on more precise features. This significantly improves the accuracy of target location detection, particularly for multi-object detection in complex scenes.
- 2.
CBAM
CBAM (Convolutional Block Attention Module) [
33] is a representative lightweight attention mechanism. Its core philosophy lies in integrating two major branches: Channel Attention and Spatial Attention. The module’s significant advantage lies in its ability to effectively enhance a convolutional neural network’s capability to capture and utilize key features, thereby improving the model’s feature extraction performance, with minimal increase in overall computational complexity and parameter count. The specific structure and workflow are illustrated in
Figure 6.
The primary objective of CAM is to explicitly model dependencies between channels, generating a channel attention map. First, the input feature map
undergoes both Global Average Pooling (GAP) and Global Max Pooling (GMP) operations. This reduces the spatial dimension of the input features to 1, generating two distinct channel descriptors. The calculation formulas are as follows:
where
denotes the value at coordinates
in the cth channel of the feature map F.
Then, the two channel descriptors are input into a shared multilayer perceptron (MLP) for feature transformation, which consists of a hidden layer and an output layer, with the hidden layer containing C/r neurons, where r is the dimension reduction factor. Let the weight matrices of the multilayer perceptron be
and
, with biases
and
. The result of global average pooling is as follows:
Global max pooling is:
where
represents the sigmoid activation function, and
represents the ReLU activation function.
The two weights are added to obtain the channel attention weight .
The primary objective of SAM is to explicitly model spatial dependencies between locations, generating a spatial attention map. First, the input feature map undergoes max pooling and average pooling along the channel dimension, yielding two feature maps.
These two feature maps are concatenated along the channel dimension to form a
feature map. This is then processed through a 7 × 7 convolutional layer (with 1 kernel, stride 1, and padding 3) to learn spatial correlations. The values of the spatial attention map are then normalized to the range [0, 1] via a Sigmoid activation function, yielding the spatial attention weights
. The calculation formula is:
where
denotes a 7 × 7 convolution operation, while [;] represents a concatenation operation along the channel dimension.
Ultimately, the channel attention weights and spatial attention weights are multiplied element-wise with F in sequence, yielding the attention-adjusted feature map .
The combination of channel and spatial attention mechanisms has proven effective in complex traffic detection tasks. Tan et al. [
34] integrated both attention mechanisms into a deep differential segmentation network for rail transit foreign object detection, successfully mitigating interference from airflow disturbances and lighting variations, thereby validating the complementary advantages of dual-dimensional attention. In pedestrian anomaly detection scenarios, abnormal movements often involve distinctive postural changes across body parts and interactions with the surrounding environment, necessitating precise capture of key information from multi-scale features. CBAM employs channel attention to enable the network to focus on feature channels crucial to target behaviors—such as those related to human contours, dynamic postural changes, and anomalous actions. Spatial attention highlights the spatial region containing the target within the image, reducing background noise interference for more accurate target recognition and localization. Furthermore, during multi-scale feature fusion, CBAM performs separate attention calculations and adjustments for feature maps at different scales. This enables more rational weight allocation during fusion, enhancing valuable features at various scales that capture details of pedestrian abnormal actions. Consequently, the effectiveness of multi-scale feature fusion is significantly improved.
2.1.4. Focal-EIoU Loss
The Intersection over Union (IoU) loss function measures the degree of overlap between predicted and ground-truth bounding boxes, thereby evaluating the accuracy of predicted boxes. YOLOv8 employs the CIoU loss function. For low-quality images, unclear edge information of objects may cause significant positional errors between predicted and ground-truth boxes, leading to poor detection performance. EIoU improves upon CIoU by separating the aspect ratio into width and height components, calculating the difference values for each dimension. The width-height loss directly minimizes the difference between the width and height of the predicted box and the ground truth box, accelerating convergence. However, it lacks a dedicated mechanism for handling sample imbalance, leading to simple samples dominating training while difficult samples receive insufficient optimization. Focal Loss [
35] addresses this by reducing the loss weight of easy samples, enabling the model to focus on hard samples and enhance overall detection robustness. Therefore, when selecting loss functions, we introduce a dynamic weighting mechanism building upon EIoU that assigns higher weights to samples with low IoU. This optimizes the sample imbalance issue in bounding box regression, compelling the model to focus on challenging samples.
Given two boxes M and N with areas
and
, respectively, their IoU is defined as:
where
denotes the intersection region between region M and region N, denoted as
, and
denotes the union region between region M and region N. Then we have:
A higher IoU value indicates greater prediction accuracy. Conversely, a lower IoU indicates poorer model performance.
Compared to the CIoU penalty term, the core optimization of the EIoU penalty lies in separating the aspect ratio influence factor to calculate the length and width of the target box and prediction box independently. This loss function comprises three components: the overlap loss
, the center distance loss
, and width-height loss
. The first two components align with CIoU.
accelerates convergence by directly reducing the gap between the target bounding box and the predicted bounding box in width and height. The specific penalty term formula is as follows:
where
and
represent the width and height of the minimum bounding box covering both boxes.
By integrating EIoU Loss and Focal L1 loss, the Focal-EIoU loss formula is derived:
where
is a hyperparameter that controls the degree of outlier suppression, regulating the curvature of the curve.