3.1.1. DAC2f Structure
This paper designs a deep fusion of the deformable attention mechanism (DAttention, DAT module) and the C2f module to construct a DAC2f (DAttention-C2f) structure. Through a closed-loop design of precise extraction, efficient fusion, and high-quality representation, it adaptively captures the core features of occluded passengers, suppresses background interferences such as seat armrests, and breaks through the global fixed paradigm.
(1) DAT Module
To address bus scenarios’ enclosed space, high passenger density, and frequent occlusions, DAttention is specifically designed to tackle occlusions: it generates unified grid reference points and dynamic offset vectors via a lightweight sub-network, focusing on key target regions; it uses bi-linear interpolation for feature sampling at deformed points to capture incomplete features of occluded passengers; and it incorporates a deformable relative positional bias to enhance spatial dependency modeling, facilitating differentiation between occluded targets and background interferences.
The core logic of the DAT Module lies in the accurate capture of key features through dynamic sampling, with the following specific process: First, a unified grid reference point set is generated for the input feature map . The grid size is determined by the downsampling factor r, , The coordinates of the reference points are normalized to the range [0, 1], where (0, 0) denotes the top-left corner and (1, 1) denotes the bottom-right corner.
Subsequently, the feature map is linearly projected into a dimension-consistent query vector and key-value feature pairs for the subsequent attention calculation query token
, that is input into the lightweight sub-network
to generate the offset vectors
. A predefined factor is introduced to constrain the amplitude of the offset vectors, ensuring training stability. Feature sampling is performed at the deformed point positions to construct keys and values. The bi-linear interpolation method is adopted during the sampling process to guarantee differentiability, and the definition of the sampling function is given in Equation (
1):
Among them,
, only the 4 integral points near the target point are non-zero, enabling efficient local feature extraction. Finally, the output features are obtained through attention computation as shown in Equation (
2):
(2) Offset Generation
The core design of the offset generation sub-network is to enhance local feature perception for learning reasonable offsets, and its architecture is shown in
Figure 1. The input features first capture local spatial information through a 5 × 5 depthwise convolution. After introducing non-linearity via the GELU activation function, a 1 × 1 convolution outputs 2D offset vectors. To avoid forced global offsets, the bias term of the 1 × 1 convolution is specifically reduced to ensure the adaptability and stability of the offset generation, while matching the characteristic that reference points cover a local s × s region.
(3) Offset Groups
To promote the diversity of deformed points, drawing on the grouping paradigm of Multi-Head Self-Attention (MHSA), the feature channels are divided into G groups. Each group of features independently generates offset vectors through a shared sub-network. In practical deployment, the number of heads (M) of the attention module is set as an integer multiple of the offset groups (G), ensuring that multiple attention heads can collaboratively adapt to each group of deformed keys and values, thereby enhancing the richness of feature representation.
(4) Deformable Relative Position Bias
To strengthen the spatial information encoding capability, a deformable relative position bias mechanism is introduced: for a feature map of size , the range of its relative coordinate displacements is and . Different from the bias table based on discrete displacements in the Swin Transformer , this paper normalizes the relative displacements to the range and performs interpolation in a continuous relative bias table . This achieves full coverage of all possible offset values and strengthens the modeling of spatial dependencies between queries and keys.
(5) Model Architecture
DAttention adopts a pyramid structure suitable for multi-scale visual tasks, as shown in
Figure 2. The input image undergoes non-overlapping convolutional embedding and normalization to obtain patch embedding features of size
. After non-overlapping convolutional embedding and normalization with a kernel size of
, patch embedding features of size
are obtained. The backbone network consists of 4 stages, realizing hierarchical feature extraction through increasing stride. Between stages,
on-overlapping convolution (stride = 2) is adopted for downsampling, which halves the spatial size and doubles the feature dimension, ultimately constructing a complete multi-scale feature pyramid.
3.1.2. SWD-PAN Structure
To address the unidirectional information flow, equalized feature weight allocation, and low efficiency of cross-scale connections in traditional Feature Pyramid Networks (FPN), this paper proposes the SWD-PAN (Slim Weighted Dynamic Path Aggregation Network) bidirectional feature pyramid network. It achieves triple optimizations: bidirectional flow, weighted adaptation, and structural simplification, constructing an efficient feature fusion architecture that enables bidirectional cross-scale feature interaction and adapts to scale differences, thus enhancing target detection accuracy in dense scenarios. Its network architecture is illustrated in
Figure 3.
(1) Bidirectional Feature Fusion Mechanism
A top-down and bottom-up bidirectional flow path is established to achieve iterative fusion of high-level semantic features (suitable for large-scale targets) and low-level detailed features (suitable for small-scale targets), which is distinct from the unidirectional FPN that fails to fully integrate cross-scale information.
(2) Learnable Weighted Fusion
Dynamic weights are assigned to features of different scales to highlight the contribution of scale-adaptive features (e.g., higher weights are allocated to small-scale features when passengers are far from the camera), addressing the issue of equal weight distribution in traditional fusion. Its core formula is as follows in Equation (
3):
where
is the bidirectional flow balance coefficient, with a range of [0, 1].
is the scale-adaptive weight.
are the features of the top-down and bottom-up paths, respectively. This formula embodies bidirectional transmission, weighted adaptation, and multi-scale integration, consistent with the structural design of SWD-PAN.
(3) Structural Optimization Strategies
Add direct connections at the same layer to retain local scale features, and reuse bidirectional paths for multi-round iteration, further enhancing the network’s adaptability to rapid scale changes.
3.1.3. WIoU Loss Function
The design of the bounding box loss function directly impacts detection box localization accuracy in target detection. This paper adopts the WIoU loss function, which uses a dynamic non-monotonic focusing mechanism to reduce the training weight of extreme samples with large gradients and increase that of medium-quality anchor boxes, avoiding gradient oscillation and accelerating stable convergence. It also dynamically adjusts gradient gain based on IoU value, making the loss function more sensitive to small positional errors of passenger targets, thus speeding up bounding box localization convergence and meeting detection requirements for small targets and passengers in non-standard postures in bus scenarios.
The penalty term formulas of the representative EIoU, Focal-EIoU (F-EIoU), and WIoU are as follows in Equations (
4)–(
6):
For a given predicted box
B and ground-truth box
,
b and
denote the center points of
B and
;
w and
represent the widths of
B and
; and
h and
represent the heights of
B and
.
denotes the Euclidean distance.
c is the diagonal length of the minimum enclosing rectangle (MER) covering both boxes.
and
are the width and height of the MER covering both boxes.
denotes the parameter controlling the consistency of outliers. They are as follows in Equations (
7)–(
11):
Among them,
and
denote the coordinates of the center point of the ground-truth bounding box;
and
denote the coordinates of the center point of the predicted bounding box.
represents the width difference between the predicted bounding box and the ground-truth bounding box;
represents the height difference between the predicted bounding box and the ground-truth bounding box;
;
and “*” denotes an operation that separates
and
from the computation process to prevent R from hindering the convergence speed. The advantages and disadvantages of the above partial loss functions are compared in
Table 1.
In bus scenarios, the design of the loss function is crucial to the model’s detection performance and is attributed to the high proportion of small targets and significant differences in target scales. WIoUv3 adopts a dynamic non-monotonic mechanism to evaluate anchor box quality and designs a reasonable gradient gain allocation strategy, reducing the occurrence of large gradients or harmful gradients in extreme samples. On this basis, WIoUv3 focuses more on anchor boxes with average quality to improve the model’s recognition accuracy and generalization ability.
To verify the adaptability of each loss function, comparative experiments are conducted on a mixed dataset consisting of public datasets and a self-built bus-scenario dataset: 10,000 images are randomly selected. The experiments are performed in an environment with CPU (Intel (R) Core (TM) i7-14700HX), GPU (NVIDIA GeForce RTX 4060), 16 GB memory, and Python 3.9.19 + PyTorch 2.3.1 (GPU). With 150 training epochs as the standard, the detection performance of different loss functions is compared (as shown in
Table 2).
As can be seen from the data in
Table 1 and
Table 2, the WIoUv3 loss function achieves the optimal performance: its precision reaches 86.3%, which is 17.41% higher than that of SIoU and 2.49% higher than that of GIoU; the mAP50 reaches 74.2%, which is 4.80% higher than that of the basic IoU and 0.27% higher than that of SIoU. Meanwhile, both the recall rate (63.1%) and frame rate (460 FPS) can meet the real-time application requirements of bus scenarios. Its core advantages lie in: reducing the gradient interference of extreme samples through the dynamic non-monotonic focusing mechanism, strengthening the training weight of anchor boxes with medium quality, effectively improving the localization accuracy of detection boxes for passengers with non-standard postures, and perfectly adapting to the complex detection needs of bus scenarios.
The architecture of the improved YOLOv8 network is shown in
Figure 4. The core achieves a leap in bus passenger object detection performance through the full-link collaborative design of “feature extraction-feature fusion-loss optimization”: the DAttention mechanism is embedded in the Backbone and Neck modules, which are deeply integrated with the C2f module to construct the DAC2f structure; the Neck module adopts the SWD-PAN network to enhance multi-scale feature fusion; and the WIoUv3 loss function is introduced into the entire network to optimize detection box localization.