A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5

: Mainstream pedestrian recognition algorithms have problems such as low accuracy and insufficient real-time performance. In this study, we developed an improved pedestrian recognition algorithm named YOLO-MSP (multiscale parallel) based on residual network ideas, and we improved the network architecture based on YOLOv5s. Three pooling layers were used in parallel in the MSP module to output multiscale features and improve the accuracy of the model while ensuring real-time performance. The Swin Transformer module was also introduced into the network, which improved the efficiency of the model in image processing by avoiding global calculations. The CBAM (Convolutional Block Attention Module) attention mechanism was added to the C3 module, and this new module was named the CBAMC3 module, which improved model efficiency while ensuring the model was lightweight. The WMD-IOU (weighted multidimensional IOU) loss function proposed in this study used the shape change between the recognition frame and the real frame as a parameter to calculate the loss of the recognition frame shape, which could guide the model to better learn the shape and size of the target and optimize recognition performance. Comparative experiments using the INRIA public data set showed that the proposed YOLO-MSP algorithm outperformed state-of-the-art pedestrian recognition methods in accuracy and speed.


Introduction
Pedestrian recognition is an important research direction in computer vision and has received widespread attention from researchers.However, the main challenges in this field are insufficient accuracy and real-time performance.Pedestrian recognition has two main methods: traditional methods and deep learning technology.Early traditional methods, such as the HOG+SVM proposed by Dalal et al. [1], focused on extracting pedestrian features through shallow learning techniques such as integrating and aggregating channel features [2,3].However, traditional methods rely on manually designed feature extractors and have difficulty covering all useful information in the image.They have poor robustness and adaptability in the face of complex and changeable real-world environments.Moreover, the level of feature extraction is relatively simple, usually only involving low-level features of images, such as edges, corners, textures, etc., and lacks the ability to extract higher-level semantic information, which limits their performance in complex scenes.Deep learning technology automatically learns complex feature expressions through multilayer nonlinear transformation without manual intervention and automatically extracts rich feature information from large amounts of data.Deep networks can extract rich features from edges to highlevel semantics through multilevel nonlinear transformations, greatly improving recognition accuracy and generalization capabilities, and they have attracted widespread attention.
Deep learning algorithms for object recognition are generally divided into two categories: two-stage networks and one-stage networks.Examples of two-stage networks are R-CNN [4], Fast R-CNN [5], and Faster R-CNN [6].They first generate candidate region proposals and then classify these region proposals.This operation greatly improves the accuracy of recognition, but because of this, the computational load of the secondary network is often very large, resulting in poor real-time performance, especially in scenarios where high-resolution images are processed.This shortcoming led to the development and use of single-stage networks, which aimed to simplify the recognition process by eliminating a separate proposal generation step, predicting object classes, and using bounding boxes directly in a single pass of the network, thus significantly speeding up recognition.One-stage networks like YOLO [7][8][9][10] and SSD [11][12][13][14][15] directly predict object categories and locations, reducing some accuracy to speed up processing, making them more suitable for real-time applications.However, SSD uses multiscale feature maps to recognize objects of different sizes, which makes it difficult to see smaller objects on lower-resolution feature maps.In addition, SSD's method of object recognition, which relies on different layers of feature maps, is not as effective as methods such as the feature pyramid network (FPN) used in subsequent versions of YOLO [16], which has led to YOLO entering the mainstream of current single-target recognition algorithms [17][18][19][20][21].
With the development of deep learning, new target recognition algorithms based on self-attention mechanisms that are different from the two-stage region proposal and one-stage predefined anchor frame recognition methods have emerged.The introduction of the Transformer architecture in 2017 revolutionized the field of natural language processing (NLP) [22].Inspired by this, the FAIR team used Transformer for object recognition, resulting in the Detr [23] algorithm.Released in 2020, Detr demonstrated competitive performance on the COCO data set.Detr uses the self-attention mechanism of the Transformer architecture to capture the global dependencies between different regions in the image, which enables the output of each position to take into account information from all locations in the image, which helps to locate and identify target objects more accurately.However, since the self-attention mechanism of the Transformer architecture essentially involves the self-attention mechanism of all elements in the sequence, this results in a large amount of calculation for attention, especially when processing high-resolution images and a large number of images, and the Transformer architecture capturing global context by processing the entire image.At the same time, this is effective for understanding the entire scene and reduces the focus on smaller objects in the image space.Therefore, Detr still faces some problems in terms of small targets and recognition speed [24][25][26].
YOLOv8 is the latest work in the YOLO series.It improves on its predecessor YOLOv5, integrating new features and improvements such as the addition of a decoupled head, classification loss as a VFL loss function, and a combination of regression losses DFL loss and CIOU loss.The emergence of pure self-attention deep networks [27] marks major progress in the field of object recognition, providing improvements in speed, performance, and versatility, but there are still some shortcomings in the missed recognition of targets and real-time performance.Therefore, this paper proposes a targeted model based on the YOLOv5 model; the above two aspects are improved, and the YOLO-MSP algorithm is proposed.Our contribution can be summarized as follows: 1.
Fusing MSP modules, which, based on residual networks and multiscale parallel concepts in the neck regions, enhances the model's feature extraction, computational efficiency, and accuracy.

2.
Fusing Swin Transformer modules in the backbone and neck regions enhances the model's feature extraction, computational efficiency, and accuracy.

3.
The CBAM and C3 modules are integrated into the CBAMC3 module to replace the C3 module in the backbone to improve the recognition accuracy of the model.

4.
The WMD-IOU loss function is introduced, designed to address the loss effects due to shape dimensions and the equalizing impacts of similar weights.

Network Architecture
The architecture of YOLOv5 adopts the CSPNet (cross-stage partial network) [28] architecture to improve computing efficiency.CSPNet is a neural network structure that reduces computational effort by dividing the network into two distinct parts (the backbone and the head).Therefore, the computational complexity is effectively reduced.The YOLOv5s network structure mainly includes four elements: input end, trunk, neck, and prediction end, as shown in Figure 1.

Method
In this section, we introduce the key improvements of the YOLO-MSP model to enhance its performance in object recognition tasks.Firstly, the MSP (multiscale parallel) module is integrated into the neck structure of the network to promote efficient information flow and improve multiscale feature extraction.Secondly, the Swin Transformer module is applied to the backbone and neck parts of the network, and a local window hierarchical attention mechanism is used to improve the accuracy and generalization of the model.Thirdly, the CBAM attention mechanism is integrated with the C3 module in the backbone structure to improve the recognition accuracy and generalization ability of the model.Finally, the WMD-IOU loss function is used to to account for the shape proportion loss to achieve more accurate target positioning.

MSP Module
The MSP module is applied to the neck part of the network structure.The specific locations are the dark-purple modules numbered 23, 27, and 31 in Figure 2.These modules are marked with red stars.
The MSP module is designed based on residual network ideas and multiscale parallelism.The module uses the connections between layers to help information move quickly and efficiently in the network.As shown in Figure 3, the CM3 component of the MSP module contains three pooling layers, each with a different output channel.It can enhance the extraction capabilities of multiscale features while ensuring real-time performance, thereby processing multiscale data features and improving the accuracy of the model.The final adaptive pooling layer helps reduce the loss of input data features, enhances the model's ability to calculate complex data, and enhances the model's robustness.
The graph of the Swish activation function is shown in Figure 4.

Swin Transformer
The Swin Transformer module is applied to the backbone and neck parts of the network structure.The specific locations are the pink modules numbered 10 and 19 in Figure 2, which are marked with a red triangle in front of the modules.Compared with the Transformer structure, the Swin Transformer has a local window hierarchical attention mechanism and uses grouped convolution and cross-layer connections to enhance model performance, accuracy, and generalization capabilities.The Swin Transformer structure is shown in Figure 5.The Swin Transformer module replaces the conventional multihead self-attention (MSA) [27] module, which utilizes a 'window' (W-MSA)-and 'shifted window' (SW-MSA)based MSA module.This is followed by a two-layer perceptron (MLP) with ReLU [30] as the nonlinear function between layers.Layer normalization (LayerNorm) and residual connections are included before and after each MSA module and MLP layer.
To facilitate self-attention, the feature map X, represented as X ∈ R H×W×C , undergoes linear projection and reshaping to produce three arrays, Q, K, and V, each in R N×C ′ , with N being the product of H and W. In this context, R N×C ′ denotes the feature space after an inevitable transformation, where N represents the total number of elements in the feature map, which is the product of the height H and the width W, while C' refers to the number of features or channels after transformation.These arrays, as transformed versions of X, are used as inputs to the self-attention mechanism, the results of which are calculated using Equations ( 2) and (3) [31][32][33][34].
Attention = TV (2) Attention matrix T ∈ R N×N is used to characterize the interrelationships among feature map elements, accumulating global information in attention outputs.Absolute position encoding involves adding learnable parameters to each module before self-attention calculation.In contrast, relative position encoding adds parameters that are relative to the computation.The relative position offset, for each axis.Optimization is achieved using the W-MSA module.In Vision Transformers, while MSA allows self-attention calculations across all pixels, W-MSA restricts these calculations to adjacent pixels within 7 × 7 windows.The computational complexity Ω for the mean calculation in a local window of size m×m in the feature map X ∈ R H×W×C is determined by Equations ( 4) and (5).
The CABMC3 module is applied to the backbone part of the network structure.This location is denoted by the light-green modules numbered 2, 4, 6, 8, and 12 in Figure 2. The front of these modules is marked with a red circle.
The CBAM attention mechanism consists of two submodules: the channel attention module focuses on extracting important information and effectively reducing the space complexity of the feature map while keeping the number of channels consistent.This enables the model to prioritize the most informative features, effectively identifying parts of objects that are still visible despite occlusions.At the same time, the spatial attention module specializes in pinpointing the precise location of objects by analyzing the spatial distribution of features without changing the number of channels.This targeted approach helps the model focus on areas where occluding objects may be present, thereby improving recognition accuracy.Integrating CBAM with the C3 module in the model architecture significantly improves the model's ability to discern important features under different conditions without increasing the total parameters.The cooperation between channel and spatial attention enables the model to retain critical information, thereby improving its robustness and accuracy in scenes where objects are partially hidden.The CBAM module is shown in Figure 6, and the CBAMC3 module is shown in Figure 7.The CBAM module takes the input feature map F with a size of H×W×C and consists of two parts: the channel attention mechanism and the spatial attention mechanism.In the channel attention mechanism, the input feature map F is processed with maximum and average pooling to produce two feature maps of 1 × 1 × C size.These two feature maps are passed through an MLP and a sigmoid activation function to obtain a onedimensional channel attention map Mc.Multiplying the channel attention map with the original feature map F yields the channel-attention-adjusted feature map F1.In the spatial attention mechanism, the feature map F1 is subjected to maximum and average pooling again, resulting in two feature maps of size H × W × 1.The two feature maps are concatenated and then convolved to generate a two-dimensional spatial attention map Ms. Finally, multiplying the spatial attention map Ms with the feature map F1 yields the output feature map of the CBAM module.This design can simultaneously focus on features in different channels and locations, thus improving the model's accuracy.

WMD-IOU Loss Function
The initial loss function used by the YOLOv5 model is the CIOU loss function.The CIOU loss for measuring similarity between two arbitrary shapes C, Ci ∈ S ∈ Rn is calculated as follows: The aspect ratio measurement in CIOU [35] is complex, leading to its slow convergence.Yi-Fan Zhang and colleagues consider decomposing the aspect ratio loss into the difference between the predicted width and height and the minimum bounding box, thereby speeding up convergence and improving regression accuracy.In addition, they introduced focal loss to solve the sample imbalance problem in the bounding box regression task.This approach reduces the optimization contribution of low-quality anchor boxes that least overlap with the target box, focusing the regression process on other high-quality anchor boxes.The formula of the EIOU [36] loss function is as follows: Although EIOU considers the overlap loss, center distance loss, and aspect ratio loss between frames, it does not fully account for changes in the target box shape ratio.To overcome this problem, YOLO-MSP introduces WMD-IOU (weighted multidimensional IOU loss function) with the shape ratio of the target and anchor boxes as the weighting factor to enhance the bounding box regression in object recognition, which makes the difference in shape ratio directly affect the size of the loss value and guides the model to better match the shape proportions.Additionally, it provides a personalized approach to the impact of loss factors, allowing for customized adjustments to different data sets.The formula of the WMD-IOU loss function is as follows: L alliou = IOU γ L iou (12) where α, β, ∆, λ, µ are weight influencing factors, tailored according to the characteristics of the data set, with a weight range of (0, 1).∆w, ∆h represent the shape ratios between the predicted box and the anchor box, addressing losses attributed to shape differences and enhancing loss diversity.In this formula, L alliou , L iou represent the final and optimized loss functions, respectively.The additional γ parameter is introduced to address the issue of sample imbalance.By adjusting γ, loss contributions from specific samples can be amplified or reduced.This allows γ to be customized based on sample characteristics, making the loss function target key samples more effectively.After integration with focal loss, the final framework thoroughly addresses the impacts of sample imbalance by considering the loss contributions of challenging samples.The summation ∑ ωt=∆w,∆h (1 − e ωt ) in the formula ensures careful consideration of differences in width and height.The use of the exponential function is because, in 1 − e ωt , as ωt increases, e ωt rapidly approaches or exceeds 1, which results in a decrease in the value of the entire expression, thus reducing the contribution of this part of the loss.This design means that when the difference in shape between the predicted box and the initial box is small, the loss increases, thereby encouraging the model to better learn to minimize such discrepancies.Additionally, it allows for a smoother adjustment of the loss function's sensitivity to shape differences, avoiding excessive perturbation to other aspects.The explanatory diagram for WMD-IOU is illustrated in Figure 8.In summary, the WMD-IOU loss function incorporates comprehensive error measurement and enhanced shape sensitivity coupled with adjustable weight coefficients.It can enable more precise target localization and increased adaptability to variations in target size and shape.

Experiment Environment
In this experiment, the software environment used was Pytorch 1.12.0, and the hardware included Intel (R) Core(TM) i7-11700K and NVIDIA GeForce RTX 3080Ti, with 16 GB of memory and Ubuntu operating system installed, as shown in Table 1.The parameters used for training included a weight decay coefficient of 0, a batch size of 16, and an epoch of 300.In order to avoid overfitting, a label smoothing method was used to optimize the image classification labels with a smoothing coefficient of 0.01.

Dataset Details
This study uses the INRIA public data set [37] collected by Navneet Daal.The INRIA data set is a widely used collection of labeled images of standing or walking pedestrians, designed to identify upright pedestrians in pictures and videos.It includes a training set consisting of 614 positive samples (1237 pedestrians) and 1218 negative samples, and a test set consisting of 288 positive samples (589 pedestrians) and 453 negative samples.This data set has the characteristics of complex background, diverse human postures, and diverse environments.In order to improve model performance, this experiment selected 200 simulated pedestrian data from the FOG part of the CrowdSim2 [38] data set and added them to the INRIA data set.The ratio of the experimental and training set, validation set, and test set of this study is 6:2:2.

Evaluation Metrics
This experiment uses evaluation indicators commonly used in deep learning object recognition, precision (P), recall (R), average precision (AP), and mean average precision (mAP) as evaluation indicators.P and R are defined using true positive (TP), false positive (FP), true negative (TN), and false negative (FN) as P-R curves are drawn according to precision P and recall R, and AP values are calculated according to the area: The mAP value is calculated based on the average of all AP in N categories The F1 value can be calculated according to the precision P and the recall R

Experimental Results
This study uses the YOLO-MSP model to conduct experiments on the INRIA public data set and compares the results with those of other representative models.As shown in Table 1, YOLO-MSP outperforms its peers with a maximum average precision (mAP) of 96% and an F1 value of 91%.Compared with YOLOv8s, YOLO-MSP's mAP has increased by 0.8%, its inference speed has increased by 0.4 ms, and its recognition frame rate has increased by 1 FPS.Notably, YOLO-MSP shows superior performance in identifying unlabeled small objects in raw images, with smaller center points and size deviations compared to other models.In the table, best and worst results are displayed in red and green, respectively.We plotted the precision-recall (P-R) of YOLOV8S and YOLO-MSP and the comparison curves of map@0.5 and mAP@0.5:0.95.Through these images, we can intuitively observe the process in which various indicators of the model gradually stabilize and improve as training proceeds.Among them, the YOLO-MSP precision-recall (P-R) improvement is more significant, and the map value is higher than that of YOLOV8S.The left side of the picture is the YOLOv8S model, and the right is the YOLO-MSP model.The images are shown in Figure 9 and Figure 10, respectively.We spliced the recognition frame results of different algorithms on the same picture in the INRIA public data, as shown in Figure 11.It can be seen that compared with other algorithms, YOLO-MSP not only has a high intersection over union (IoU) but also can recognize small targets that are not labeled in the original data set.The overall map@0.5 value of the recognition frame and the inference speed per image (ms) value of different algorithms on the same image are shown in Table 2.In the table, best and worst results are displayed in red and green, respectively.

Low-Brightness Experiments
The YOLO algorithm has a poor target recognition rate under low-brightness conditions.Therefore, this study used some data from the KAIST [39] data set to conduct low-brightness experiments to compare the recognition performance differences between the improved model and other models.The INRIA data set was used in all model training processes, and evaluation was performed every 10 training cycles.A total of 300 epochs were trained in the experiment.According to the recognition results, it can be seen that this model has a better recognition rate than other models under low brightness conditions, and the mAP value can meet the needs of pedestrian recognition at night.The experimental results are shown in Figure 12.

Ablation Experiments
Ablation experiments were conducted in this study to compare the performance difference between the improved and other models.Label smoothing was used to process images during the training, and evaluation was conducted every ten training epochs.A total of 300 epochs were trained in the experiment, and the results are presented in Table 3.In the table, best and worst results are displayed in red and green, respectively.

Hyperparameter Tuning
Manual tuning was performed to obtain the best hyperparameters for the YOLO-MSP model and observe its performance differences on a self-made data set due to the effects of different hyperparameter settings on the recognition performance of YOLO-MSP.The experimental results are shown in the table.The experiments proved that the model achieved the best performance when using a batch size of 16 and an SGD optimizer and when scaling the image size to 1280 × 1280.We believe that a scale size of 1280 can better preserve the feature information of small objects in the images, thus improving the recognition accuracy.The experimental results are presented in Table 4.In the table, best and worst results are displayed in red and green, respectively.

Crossover Experiments
In this study, to evaluate the model's robustness, we divided the entire data set into ten mutually exclusive subsets.Then, we adopted the five-fold cross-validation method: two subsets were selected as the test set each time, and the remaining eight subsets were used as the training set.In this way, we observed the model's performance changes under different training sets.
Experimental results show that the model's average accuracy (mAP) on different data sets reaches a maximum of 0.960 and a minimum of 0.947; the highest value of the F1 score is 0.91, and the lowest value is 0.89.These results show that despite changes in training data, the model's performance remains at a high level, indicating that the model is robust.In addition, experiments also show that the model's real-time performance remains unchanged on different data sets, further verifying the stability of the model.The experimental results are presented in Table 5.

Conclusions
The YOLO-MSP algorithm proposed in this study uses the MSP module based on residual networks and multiscale parallel concepts to enhance pedestrian feature recognition.The local attention mechanism of the Swin Transformer can reduce the amount of calculation and improve the real-time performance of the algorithm.The CABMC3 module is formed by merging the CBAM and C3 modules to improve the recognition accuracy of the algorithm.The WMD-IOU loss function takes into account the shape difference loss, thereby improving the recognition accuracy of the target.However, since the algorithm reduces a certain accuracy rate in order to pursue a higher recall rate, the F1 value cannot fully reflect the overall advantages of the model.The YOLO-MSP algorithm is very suitable for application scenarios with a large amount of pedestrian traffic and requiring powerful real-time detection capabilities.This is of particular value for autonomous driving systems that require accurate and timely recognition of pedestrians.In the future, our research will combine more advanced attention mechanisms and loss functions to further improve detection accuracy without affecting the real-time performance of the algorithm, focusing on fine-tuning the balance between precision and recall to improve the F1 score.In addition, we plan to apply the algorithm to various environments and test its effectiveness under different weather and lighting conditions, which is crucial for its application in real scenarios.

Figure 1 .
Figure 1.The original structure diagram of the Yolov5s model.We incorporated many improvements into the YOLOv5 model architecture.Inspired by the concept of residual networks, the MSP module is introduced in the model neck to increase extractable features and reduce training errors, thereby improving accuracy.The Swin Transformer modules have been integrated into the backbone and neck of the model.In addition, the CBAM attention mechanism and the C3 module are merged to form the CBAMC3 module, which replaces the original C3 module in the backbone, retains the lightweight design of the model, and improves model recognition accuracy.Figure 2 shows the updated YOLO-MSP model architecture.
Figure2shows the updated YOLO-MSP model architecture.

Figure 3 .
Figure 3. MSP module structure diagram.The MSP module uses the Swish [29] activation function, which is continuously differentiable and smoother than the ReLU activation function.It helps to improve the smoothness of network model training.The smoother the activation function, the faster the convergence speed and the higher the learning rate.The formula of the Swish activation function is shown in Equation (1):

Figure 5 .
Figure 5. Swin Transformer structure diagram.The input image is initially divided into patches by the patch partition module, with each patch containing 4 × 4 adjacent pixels, then flattened along the channel dimension, altering the image shape from [H, W, 3] to [H/4, W/4, 48].A linear embedding layer is applied to modify the channel dimension from 48 to C, transforming the image shape to [H/4, W/4, C].Subsequently, four stages with varying feature map sizes are created.The Swin Transformer module replaces the conventional multihead self-attention (MSA)[27] module, which utilizes a 'window' (W-MSA)-and 'shifted window' (SW-MSA)based MSA module.This is followed by a two-layer perceptron (MLP) with ReLU[30] as the nonlinear function between layers.Layer normalization (LayerNorm) and residual connections are included before and after each MSA module and MLP layer.

Figure 8 .
Figure 8. WMD-IOU loss function boundary regression diagram.The yellow area represents the overall image size, the red line area represents the real box size, the green line area represents the predicted box size, and the blue line represents the offset between the predicted box and the real box.

Figure 9 .
Figure 9. Training indicator trend chart of YOLOV8S and YOLO-MSP.The horizontal axis represents the number of training iterations, and the vertical axis represents the corresponding indicator value.The figure from left to right shows average precision (mAP@0.5 and mAP@0.5:0.95) and recall rate (precision/recall).

Figure 11 .
Figure 11.The recognition frame results of different algorithms on the INRIA public data set, from left to right, are the real frame, YOLOv5s, YOLOv7-tiny, Detr, YOLOV8S, and YOLO-MSP.

Figure 12 .
Figure 12.The recognition frame results of different algorithms on the KAIST public data set, from left to right, are the YOLOv5s, YOLOv7-tiny, Detr, YOLOV8S, and YOLO-MSP.

Table 2 .
Comparative analysis of object recognition algorithms.

Table 4 .
Experimental results of YOLO-MSP under different hyperparameters.

Table 5 .
Experimental results of YOLO-MSP under different crossover settings.