Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian

: Crowding and occlusion pose signiﬁcant challenges for pedestrian detection, which can easily lead to missed and false detections for small-scale and occluded pedestrian objects in dense pedestrian scenarios. To enhance dense pedestrian detection accuracy, we propose the Residual Trans-former YOLO (RT-YOLO) algorithm in this paper. The RT-YOLO algorithm enhances the multi-scale fusion strategy based on YOLOv7 and introduces a dedicated detection layer for small-scale occluded targets. It also integrates Resnet and Transformer structures to improve the small-scale feature layer and detection head, enhancing feature extraction capabilities. Additionally, the RT-YOLO algorithm incorporates the Normalization-based Attention Module (NAM) into the backbone and neck networks to identify the region of interest. The experiments demonstrate that on the CrowdHuman and WiderPerson datasets, at IOU (Intersection over Union) = 0.5, the overall improvement in mAP 50 is 3.8% and 3.4%. In the IOU range from 0.5 to 1, the improvement in mAP 50 : 95 is 5.1% and 4%. RT-YOLO achieves an FPS of 67, maintaining real-time performance. On the VOC2007 dataset, mAP 50 has been enhanced by 5.1%, indicating higher effectiveness and robustness.


Introduction
Object detection is one of the main research directions in computer vision, where the main task is to recognize specific classes and precise coordinates of objects in images on demand [1].With the rapid development of autonomous driving, video surveillance, and other fields, human-centered object detection has received a great deal of attention.In pedestrian detection tasks, crowded and occluded scenarios pose significant challenges.Due to factors such as the angle and distance captured by the detection lens, pedestrian targets vary in scale, and they can overlap with each other, resulting in a limited number of effective feature pixels within the detection area.This makes it difficult to distinguish boundaries, and in such cases, the detector may mistakenly identify multiple pedestrian targets as a single target or experience detection box drift.The challenges mentioned above in crowded pedestrian detection are also commonly encountered in numerous contemporary object detection tasks [2].
Pedestrian detection algorithms can be categorized into two types: traditional manual feature detection and deep learning detection algorithms.Traditional detection algorithms depend on manually designed features for object characterization.These features include Haar wavelet features [3], which combine human movement and appearance.They also encompass HOG features that describe pedestrian contours using edge direction information [4], LBP features with grayscale and rotation invariance [5], and structural SIFT features with scale invariance [5].Traditional manual feature detection methods have greatly contributed to the development of pedestrian detection research.However, such algorithms exhibit limited robustness in complex scenarios.In recent years, due to the rapid development of deep learning, the research focus on pedestrian detection has shifted from manual features to deep learning detection algorithms.
The pedestrian detection algorithm based on deep learning, X. Liang et al. improved Fast R-CNN [6] for pedestrian detection by designing 2 sub-networks for large-scale and small-scale pedestrians [7], and enhanced the efficiency of pedestrian detection by using candidate regions extracted by the ACF detector [8].L. Zhang et al. analyze the performance of Faster R-CNN [9] in pedestrian detection and use a mixed strategy to classify the candidate regions extracted by RPN through random forest [10].Y. Tian et al. aim to solve the occlusion issues in crowded pedestrian detection and propose a part-based detection scheme named DeepParts [11].M. Hong et al. propose SSPNet [12], which suppresses interference from complex backgrounds using features at different scales and preserves valuable pedestrian information by sharing features.S. Huang et al. [13] propose a feature-aligned pyramid network to improve the efficiency of dense pedestrian detection by aligning up-sampled features with local features via pixel offsets.The above algorithms have excellent performance in pedestrian detection, but the feature extraction capability for small-scale and occluded pedestrian objects still needs to be enhanced, which is prone to missed and false detection.
In order to solve the false and missed detection in dense pedestrian detection, we propose Residual Transformer YOLO (RT-YOLO) based on YOLOv7 [14], and verify its performance through ablation, comparison, and generalization experiments.The main contributions of this paper can be summarized as follows: • Proposing a new small-scale, occluded object detect head named Bottleneck Transformer Detect Head (BTDetect), which is mainly used to receive the two largest scale feature maps and fully extract detail information to achieve accurate localization for small-scale, occluded objects; • According to the Resnet Bottleneck structure, Bottleneck Transformer Encoder Layer (BOTrans) is proposed to enhance the prediction potential of the model using Self-Attention in Transformer;

•
Combining convolutional neural networks with Transformer structure to enhance detection performance while reducing gradient dispersion and feature loss caused by the deepening of network hierarchy; • A new multi-scale fusion strategy is proposed.It includes dedicated feature network layers designed for occluded and small-scale objects, increasing the receptive field of network and efficiently fuses low-level detail information with high-level semantic information, and also improves the generalization ability of the model;

•
Combining the Normalization-based Attention Module (NAM attention), the model focuses on the areas of interest from both spatial and channel perspectives.This effectively utilizes network parameters and reduces interference from background noise.

Related Work
YOLOv7 [14] is the most representative model in the YOLO series [14][15][16][17][18][19] at present, and many subsequent versions of YOLO have also drawn inspiration from its structure.Within the range of 5 FPS to 160 FPS, YOLOv7 outperforms most known object detectors in terms of both speed and accuracy.Its main structure is divided into Backbone, Neck network, and YOLO detection head, as shown in Figure 1.
The Backbone is the initial feature extraction network of YOLOv7.When a 640 × 640 image is input into the Backbone, a set of extracted feature points is obtained in the form of feature maps.Efficient Aggregation Networks (ELAN) [14] will allow the deep network to efficiently extract features and converge reasonably by controlling the shortest and longest paths, as shown in Figure 2a, and finally, Backbone will output feature maps at different scales: 80 × 80, 40 × 40, and 20 × 20.The Backbone is the initial feature extraction network of YOLOv7.When a 640 × 640 image is input into the Backbone, a set of extracted feature points is obtained in the form of feature maps.Efficient Aggregation Networks (ELAN) [14] will allow the deep network to efficiently extract features and converge reasonably by controlling the shortest and longest paths, as shown in Figure 2a  The Neck is the enhanced feature extraction network of YOLOv7.Three feature maps obtained from Backbone are fused by up-sampling in Neck.Subsequently, the fused features are down-sampled, where the Extending Efficient Aggregation Networks (E-ELAN) extends the channel and cardinality by introducing group convolution based on ELAN and it is shown in Figure 2b, which applies the same parameter group and channel mul-  The Backbone is the initial feature extraction network of YOLOv7.When a 640 × 640 image is input into the Backbone, a set of extracted feature points is obtained in the form of feature maps.Efficient Aggregation Networks (ELAN) [14] will allow the deep network to efficiently extract features and converge reasonably by controlling the shortest and longest paths, as shown in Figure 2a, and finally, Backbone will output feature maps at different scales: 80 × 80, 40 × 40, and 20 × 20.The Neck is the enhanced feature extraction network of YOLOv7.Three feature maps obtained from Backbone are fused by up-sampling in Neck.Subsequently, the fused features are down-sampled, where the Extending Efficient Aggregation Networks (E-ELAN) extends the channel and cardinality by introducing group convolution based on ELAN and it is shown in Figure 2b, which applies the same parameter group and channel multiplier to all blocks of the computational layer, and the number of channels remains the The Neck is the enhanced feature extraction network of YOLOv7.Three feature maps obtained from Backbone are fused by up-sampling in Neck.Subsequently, the fused features are down-sampled, where the Extending Efficient Aggregation Networks (E-ELAN) extends the channel and cardinality by introducing group convolution based on ELAN and it is shown in Figure 2b, which applies the same parameter group and channel multiplier to all blocks of the computational layer, and the number of channels remains the same as ELAN, so the utilization of parameters and computational efficiency improves further.Neck finally outputs three enhanced effective feature maps at sizes of 80 × 80, 40 × 40, and 20 × 20.
The YOLO Head is the classifier and regressor in YOLOv7.It includes RepConv [20] reparameterization to fuse training multi-branch weights and enhance detection speed [21].
The three feature maps obtained from Neck have width, height, and a number of channels, effectively forming a set of all feature points.YOLO Head assesses whether objects are present at the feature points corresponding to the anchor box and then outputs the class and location information of these objects.
YOLOv7 possesses a deep network architecture, and it performs excellently in most application scenarios.However, as the network depth increases, it may not extract features from occluded or densely packed objects adequately, potentially leading to the loss of fine-grained details.These issues are also reflected in the context of crowded pedestrian detection.

The Proposed Method
In this paper, we propose the RT-YOLO algorithm, which includes a small-scale occlusion object prediction head, integrates Transformer, and utilizes NAM attention to improve the detection performance of the algorithm for objects of varying scales and occlusion in crowded pedestrian scenarios.

Residual Transformer YOLO
The structure of the RT-YOLO algorithm is shown in Figure 3 and consists of three components.The first part is the backbone feature extraction network.After the initial feature extraction from the input image, it generates four feature maps at layers 4, 6, 9, and 12.Following this, these feature maps undergo channel adjustment through convolutional groups (Conv) and are subsequently connected to the Neck backbone for down-sampling, thereby achieving feature fusion.
same as ELAN, so the utilization of parameters and computational efficiency improves further.Neck finally outputs three enhanced effective feature maps at sizes of 80 × 80, 40 × 40, and 20 × 20.
The YOLO Head is the classifier and regressor in YOLOv7.It includes RepConv [20] reparameterization to fuse training multi-branch weights and enhance detection speed [21].The three feature maps obtained from Neck have width, height, and a number of channels, effectively forming a set of all feature points.YOLO Head assesses whether objects are present at the feature points corresponding to the anchor box and then outputs the class and location information of these objects.
YOLOv7 possesses a deep network architecture, and it performs excellently in most application scenarios.However, as the network depth increases, it may not extract features from occluded or densely packed objects adequately, potentially leading to the loss of fine-grained details.These issues are also reflected in the context of crowded pedestrian detection.

The Proposed Method
In this paper, we propose the RT-YOLO algorithm, which includes a small-scale occlusion object prediction head, integrates Transformer, and utilizes NAM attention to improve the detection performance of the algorithm for objects of varying scales and occlusion in crowded pedestrian scenarios.

Residual Transformer YOLO
The structure of the RT-YOLO algorithm is shown in Figure 3 and consists of three components.The first part is the backbone feature extraction network.After the initial feature extraction from the input image, it generates four feature maps at layers 4, 6, 9, and 12.Following this, these feature maps undergo channel adjustment through convolutional groups (Conv) and are subsequently connected to the Neck backbone for downsampling, thereby achieving feature fusion.Neck enhances feature extraction by merging shallow and deep features, with the aim of combining information from different scales.The feature maps from Backbone are stacked with Neck's main path for up-sampling to achieve feature fusion, and the already obtained feature maps continue to down-sample after BOTrans and ELAN to achieve feature fusion once more.Finally, the obtained feature maps are sent to the YOLO Head for decoupling.
The third part is the YOLO Head, which has four detection heads, including two smallscale, occluded object prediction heads named BTDetect.These heads receive two feature maps containing the richest detailed information.Afterward, a 1 × 1 Conv layer is applied to decouple the output results for object location and category information.In addition, introducing NAM attention before the residual branches can reduce the interference of irrelevant features and enhance the algorithm convergence efficiency without increasing computational and spatial complexity.

Bottleneck Transformer Encoder Layer and Detect Head
Transformer [22] signifies a pivotal advancement in deep learning designed to enhance NLP performance.Its remarkable capabilities inspire DETR [23] to integrate it into object detection.Transformer exhibits superior performance and potential in various vision tasks compared to traditional convolutional networks.However, Transformer comes with high computational and spatial complexity due to its self-attentive computation of global features, which directly relates to the number of feature pixels.To address this issue, combining Transformer with ResNet [24] reduces computational and spatial complexity, resulting in a more potent visual model compared to conventional convolutional network structures.BOTrans uses the global Multi-Head Self-Attention (MHSA) of the Transformer Encoder to replace the Conv layer of ResNet, reducing parameters and lowering latency.The structure of BOTrans is shown in Figure 4b.
stacked with Neck's main path for up-sampling to achieve feature fusion, and the alread obtained feature maps continue to down-sample after BOTrans and ELAN to achieve fea ture fusion once more.Finally, the obtained feature maps are sent to the YOLO Head fo decoupling.
The third part is the YOLO Head, which has four detection heads, including tw small-scale, occluded object prediction heads named BTDetect.These heads receive tw feature maps containing the richest detailed information.Afterward, a 1 × 1 Conv layer i applied to decouple the output results for object location and category information.I addition, introducing NAM attention before the residual branches can reduce the inter ference of irrelevant features and enhance the algorithm convergence efficiency withou increasing computational and spatial complexity.

Bottleneck Transformer Encoder Layer and Detect Head
Transformer [22] signifies a pivotal advancement in deep learning designed to en hance NLP performance.Its remarkable capabilities inspire DETR [23] to integrate it int object detection.Transformer exhibits superior performance and potential in various v sion tasks compared to traditional convolutional networks.However, Transformer come with high computational and spatial complexity due to its self-attentive computation o global features, which directly relates to the number of feature pixels.To address this is sue, combining Transformer with ResNet [24] reduces computational and spatial com plexity, resulting in a more potent visual model compared to conventional convolutiona network structures.BOTrans uses the global Multi-Head Self-Attention (MHSA) of th Transformer Encoder to replace the Conv layer of ResNet, reducing parameters and low ering latency.The structure of BOTrans is shown in Figure 4b.Resnet typically has 4 stages with strides (4, 8, 16, 3), and BOTrans is derived by re placing 3 groups of 3 × 3 Conv layers in stage4 with MHSA, as detailed in Table 1. Thi substitution reduces BOTrans parameter count by 18% when compared to ResNet usin a 1024 × 1024 input size.BOTrans maintains an identical structure to ResNet, allowin seamless integration into object detection models.This integration enhances their featur Resnet typically has 4 stages with strides (4, 8, 16, 3), and BOTrans is derived by replacing 3 groups of 3 × 3 Conv layers in stage4 with MHSA, as detailed in Table 1.This substitution reduces BOTrans parameter count by 18% when compared to ResNet using a 1024 × 1024 input size.BOTrans maintains an identical structure to ResNet, allowing seamless integration into object detection models.This integration enhances their feature extraction capabilities.In RT-YOLO, BOTrans focuses on various regions within the image, acquiring relationships and contexts between these regions.This mechanism helps capture the global context and semantic information of objects, leading to a more accurate differentiation of occluded object boundaries and precise localization of small-scale objects.
Figure 5 shows the structure of MHSA, which differs from the canonical MHSA in three main aspects.Firstly, the number of heads is reduced from 8 to 4. Secondly, the content-position encoding r is two-dimensional instead of one-dimensional, with R h and R w representing relative information in the vertical and horizontal directions.Thirdly, position encoding is embedded after the MHSA layer.MHSA multiplies the input image x with learnable parameters W Q , W K , and W V matrices to obtain corresponding query matrix q, matching matrix k, and information matrix v.In addition, relative position encoding r is used in Self-Attention.By multiplying r T with q, position-related information is obtained, thereby integrating position sense into the attention calculation.This enables the model to consider spatial relationships between features and their positions, ultimately enhancing feature extraction efficiency and robustness.Subsequently, q is multiplied with k T and then summed with position encoding r.After softmax, it obtains weights for v.These weights are used to calculate weighted relevance scores, which are the results of the self-attention computation.MHSA takes parallel computation between multi-heads and exchanges information by sharing parameters W and the features are learnt from different attention heads are combined to obtain the results.The multi-head computation can be calculated as follows: Where Here, Equation (1) represents the result after multi-head self-attention computation, while Equation ( 2) represents the result of a single head's computation, where "i" represents the head number.Q, K, V are the query, matching, and information matrices generated based on the input image features, which is the set of q, k, and v mentioned above.d represents the length of vector k.Using √ d k is to prevent the values from becoming too large after multiplying the Q and K matrices, which could result in small gradients after softmax.T represents matrix transpose.W is the learnable shared parameter, and n is the number of attention heads, in BOTrans, n = 4.
As shown in Figure 6, the RT-YOLO replaces the IDetect head in the original structure with the Bottleneck Transformer Detect Head (BTDetect).BTDetect has the capability to globally localize small-scale and occluded objects, enhancing the predictive potential of the neural network model.As shown in Figure 3, there are four various scale prediction heads in RT-YOLO.To alleviate algorithmic complexity and reduce memory cost, BTDetect is selectively applied exclusively after Feature maps 3 and 4, which contain the largest scale and the most detailed information.This selective approach ensures real-time performance for RT-YOLO while achieving precise localization of small-scale and occluded objects.Here, Equation ( 1) represents the result after multi-head self-attention computation, while Equation ( 2) represents the result of a single head's computation, where "" represents the head number., ,  are the query, matching, and information matrices generated based on the input image features, which is the set of  ,  , and  mentioned above. represents the length of vector .Using �  is to prevent the values from becoming too large after multiplying the Q and K matrices, which could result in small gradients after softmax. represents matrix transpose. is the learnable shared parameter, and  is the number of attention heads, in BOTrans,  = 4.
As shown in Figure 6, the RT-YOLO replaces the IDetect head in the original structure with the Bottleneck Transformer Detect Head (BTDetect).BTDetect has the capability to globally localize small-scale and occluded objects, enhancing the predictive potential of the neural network model.As shown in Figure 3, there are four various scale prediction heads in RT-YOLO.To alleviate algorithmic complexity and reduce memory cost, BTDetect is selectively applied exclusively after Feature maps 3 and 4, which contain the largest scale and the most detailed information.This selective approach ensures real-time performance for RT-YOLO while achieving precise localization of small-scale and occluded objects.

Multi-Scale Fusion Strategy
RT-YOLO combines BOTrans and employs a multi-scale early fusion strategy during training.This strategy starts with feature fusion, followed by sending the fused feature maps to the prediction head.As the depth of the network progresses, the feature map scale changes with the receptive field, which is the size of the area mapped on the original image by the pixel points on the feature map.High-level feature maps with low resolution, rich semantic information and a global receptive field are advantageous for detecting medium and large objects.Conversely, low-level feature maps with high resolution, detailed information, and a small receptive field are better suited for detecting small-scale objects.Therefore, by fusing feature maps at different scales and combining high-dimensional semantic information with low-dimensional detail information, the network characterization ability is improved.This reduces the instances of missing or falsely detecting small-scale objects.
The multi-scale fusion training in the RT-YOLO algorithm involves three steps.First, the Backbone extracts image features to create four feature maps at different scales.Next,

Multi-Scale Fusion Strategy
RT-YOLO combines BOTrans and employs a multi-scale early fusion strategy during training.This strategy starts with feature fusion, followed by sending the fused feature maps to the prediction head.As the depth of the network progresses, the feature map scale changes with the receptive field, which is the size of the area mapped on the original image by the pixel points on the feature map.High-level feature maps with low resolution, rich semantic information and a global receptive field are advantageous for detecting medium and large objects.Conversely, low-level feature maps with high resolution, detailed information, and a small receptive field are better suited for detecting small-scale objects.Therefore, by fusing feature maps at different scales and combining high-dimensional semantic infor-mation with low-dimensional detail information, the network characterization ability is improved.This reduces the instances of missing or falsely detecting small-scale objects.
The multi-scale fusion training in the RT-YOLO algorithm involves three steps.First, the Backbone extracts image features to create four feature maps at different scales.Next, these four feature maps are fused with the main path of the Neck and up-sampled to produce new feature maps.In the third step, E-ELAN and BOTrans enhance the new feature maps for feature extraction, and then they are down-sampled to achieve feature fusion, resulting in four feature maps of various scales.In YOLOv7, three feature maps are sent to the detection head for prediction, and one of them is dedicated to detecting small objects.In RT-YOLO, there are four feature maps in total, with two of them specifically used for detecting small objects.
As shown in Figures 3 and 7, the two low-level feature maps from layers 4 and 6 in the Backbone have high resolution and rich detailed information, making them suitable for small-scale object detection.The two feature maps are first modified in terms of channel numbers using a 1 × 1 Convolution with a stride of 1.The output features from layer 4 are then combined with the primary features in Neck layer 19 after being up-sampled.Subsequently, the output features from Layer 4 are fused with the output features from Layer 19, and after enhanced feature extraction by BOTrans, Feature Map 4 is obtained.The output features from Layer 6 are fused with the output features from Layer 18, then combined with Feature Map 4 to yield Feature Map 3.These feature maps combine detailed information from low-level networks and rich semantic information from deeplevel networks.They are designed for the specific purpose of detecting small-scale and occluded objects.The acquisition method for Feature Map 2 and Feature Map 1 remains consistent with YOLOv7.In order to observe objects information of interest for each scale of feature map, all objects localization information is presented through gradient heat maps.
The GradCam [25] tool with a confidence threshold (conf) = 0.5 is used to visualize the gradient heat map, as shown in Figure 8. Objects in Figure 8a have head and human body, which are small-scale and partially obscured.Feature map 1 in Figure 8c lacks detailed information and primarily extracts object information at larger scales in the near field.Feature map 2 in Figure 8d extracts the global large-scale object information based on Figure 8c, but struggles to distinguish edge features.Feature map 3 in Figure 8e becomes more sensitive to small-scale human features after the initial fusion with low-level features.However, it may not accurately locate smaller-scale head features.Feature map 4 in Figure 8f is rich in detail and semantic information, making it sensitive to small-scale objects and accurately locating head features in the image.In summary, RT-YOLO undergoes multi-scale fusion training, enhancing its ability to detect small-scale and occluded objects while retaining its capability to detect large-scale objects.The GradCam [25] tool with a confidence threshold (conf) = 0.5 is used to visualize the gradient heat map, as shown in Figure 8. Objects in Figure 8a have head and human body, which are small-scale and partially obscured.Feature map 1 in Figure 8c lacks detailed information and primarily extracts object information at larger scales in the near field.Feature map 2 in Figure 8d extracts the global large-scale object information based on Figure 8c, but struggles to distinguish edge features.Feature map 3 in Figure 8e becomes more sensitive to small-scale human features after the initial fusion with low-level on Figure 8c, but struggles to distinguish edge features.Feature map 3 in Figure 8e becomes more sensitive to small-scale human features after the initial fusion with low-level features.However, it may not accurately locate smaller-scale head features.Feature map 4 in Figure 8f is rich in detail and semantic information, making it sensitive to small-scale objects and accurately locating head features in the image.In summary, RT-YOLO undergoes multi-scale fusion training, enhancing its ability to detect small-scale and occluded objects while retaining its capability to detect large-scale objects.

Normalization-Based Attention Module
The attention mechanism allows the neural network to concentrate on essential image areas, suppressing interference from irrelevant regions.Its lightweight, plug-and-play nature enhances the performance with lower cost.Attention mechanisms are generally categorized into spatial, channel, and hybrid domains, such as CBAM [26], GAM [27], and other attention mechanisms that excel in neural networks are designed in these perspectives.
In RT-YOLO, Normalization-based Attention (NAM) considers weight factors contribution to attention.It employs the scaling factor of Batch Normalization to calculate these weights, eliminating the need for repetitive stacking of fully connected and convolutional layers.NAM enhances RT-YOLO focus on specific features or channels, automatically adjusting weights based on input data characteristics and network requirements, reducing interference, and enabling the model to concentrate on meaningful features.The NAM adopts the modular integration approach from CBAM and introduces weight contribution factors to reconfigure the spatial and channel attention modules.This enables

Normalization-Based Attention Module
The attention mechanism allows the neural network to concentrate on essential image areas, suppressing interference from irrelevant regions.Its lightweight, plug-and-play nature enhances the performance with lower cost.Attention mechanisms are generally categorized into spatial, channel, and hybrid domains, such as CBAM [26], GAM [27], and other attention mechanisms that excel in neural networks are designed in these perspectives.
In RT-YOLO, Normalization-based Attention (NAM) considers weight factors contribution to attention.It employs the scaling factor of Batch Normalization to calculate these weights, eliminating the need for repetitive stacking of fully connected and convolutional layers.NAM enhances RT-YOLO focus on specific features or channels, automatically adjusting weights based on input data characteristics and network requirements, reducing interference, and enabling the model to concentrate on meaningful features.The NAM adopts the modular integration approach from CBAM and introduces weight contribution factors to reconfigure the spatial and channel attention modules.This enables NAM to be seamlessly integrated directly after the network layer and residual structure.The Batch Normalization (BN) weight contribution factor is calculated as follows: where µ β and σ β are mean and standard deviation of output feature map B. γ and β are trainable mapping parameters and also variance in BN.In the model training process, the larger variance represents the richer feature information contained in the channel, which is a more important region to concern.
In RT-YOLO, NAM is added before the feature fusion is performed to improve the feature extraction efficiency, and the specific operation is to calculate channel attention and spatial attention for input features, whose calculation formula is computed as follows: Here, X represents the input, y represents the output, and the structure of the channel attention sub-module M c redesigned by weight contribution factor is shown in Figure 9 and Equation (5).
Appl.Sci.2023, 13, x FOR PEER REVIEW 11 of 21 In RT-YOLO, NAM is added before the feature fusion is performed to improve the feature extraction efficiency, and the specific operation is to calculate channel attention and spatial attention for input features, whose calculation formula is computed as follows: Here,  represents the input,  represents the output, and the structure of the channel attention sub-module   redesigned by weight contribution factor is shown in Figure 9 and Equation ( 5).  =  �  �( 1 )�� Here,  is the scaling factor of each channel and weight is , when the feature map passes the channel attention sub-module in process called pixel normalization, the obtained feature map is input to the spatial attention sub-module   that combines weight contribution factors, whose structure is shown in Figure 10 and Equation (6).Here, γ is the scaling factor of each channel and weight is , when the feature map passes the channel attention sub-module in process called pixel normalization, the obtained feature map is input to the spatial attention sub-module M s that combines weight contribution factors, whose structure is shown in Figure 10 and Equation (6).
Here, λ is the scaling factor of the space, and the weight is . In order to eliminate effects caused by irrelevant weights, NAM adds a regular term to its loss function, whose expression is Equation (7).
Here, W is the weight, l(•) is the loss function, g(•) is l 1 paradigm penalty function, and p is the penalty for balancing g(γ) and g(λ).
=  �  �( 1 )�� Here,  is the scaling factor of each channel and weight is , when the feature map passes the channel attention sub-module in process called pixel normalization, the obtained feature map is input to the spatial attention sub-module   that combines weight contribution factors, whose structure is shown in Figure 10 and Equation ( 6).  =  �  �( 2 )�� Here, λ is the scaling factor of the space, and the weight is . In order to eliminate effects caused by irrelevant weights, NAM adds a regular term to its loss function, whose expression is Equation (7).

Dataset and Experimental Environment
In this paper, the experiment platform is based on Intel(R) Xeon(R) E5-1650 v3 and NVIDIA GeForce RTX 2080Ti, based on Pytorch 1.12.1,CUDA 11.6.The training data is the publicly crowded pedestrian dataset CrowdHuman [28], and the WiderPerson [29] dataset is used to verify the robustness of the algorithm.The CrowdHuman dataset comprises a training set with 15,000 images, a test set with 5000 images, and a validation set with 4370 images.Due to variations in the scenes, the image dimensions range from 1000 × 600 to 2000 × 1500 pixels.The training and validation set together contain around 470,000 instances, with approximately 23 people per image.It provides three types of annotations: head, full body, and visible body, while also featuring various occlusions.As shown in Figure 11, the "full body" annotation interferes with the precise localization of the targets, so it has been removed.The WiderPerson dataset consists of 13,382 images with approximately 400,000 annotations for various occlusion scenarios.Image dimensions vary between 1000 × 600 and 2000 × 1500 pixels.As there was no official test set provided, we randomly divided the dataset into 8000 training images, 4382 testing images, and 1000 validation images.In WiderPerson, there are five annotations: pedestrians, riders, partially visible persons, ignore regions, and crowd.Notably, riders and partially visible persons are less frequent, so they are grouped together with pedestrians for a more comprehensive analysis.As shown in Figure 12, the ignore regions and crowd labels are removed from the dataset, because they do not match the experimental requirements.Here,  is the weight, (•) is the loss function, (•) is  1 paradigm penalty function, and  is the penalty for balancing () and ().

Dataset and Experimental Environment
In this paper, the experiment platform is based on Intel(R) Xeon(R) E5-1650 v3 and NVIDIA GeForce RTX 2080Ti, based on Pytorch 1.12.1,CUDA 11.6.The training data is the publicly crowded pedestrian dataset CrowdHuman [28], and the WiderPerson [29] dataset is used to verify the robustness of the algorithm.The CrowdHuman dataset comprises a training set with 15,000 images, a test set with 5000 images, and a validation set with 4370 images.Due to variations in the scenes, the image dimensions range from 1000 × 600 to 2000 × 1500 pixels.The training and validation set together contain around 470,000 instances, with approximately 23 people per image.It provides three types of annotations: head, full body, and visible body, while also featuring various occlusions.As shown in Figure 11, the "full body" annotation interferes with the precise localization of the targets, so it has been removed.The WiderPerson dataset consists of 13,382 images with approximately 400,000 annotations for various occlusion scenarios.Image dimensions vary between 1000 × 600 and 2000 × 1500 pixels.As there was no official test set provided, we randomly divided the dataset into 8000 training images, 4382 testing images, and 1000 validation images.In WiderPerson, there are five annotations: pedestrians, riders, partially visible persons, ignore regions, and crowd.Notably, riders and partially visible persons are less frequent, so they are grouped together with pedestrians for a more comprehensive analysis.As shown in Figure 12, the ignore regions and crowd labels are removed from the dataset, because they do not match the experimental requirements.

Evaluation Metrics and Experimental Details
In this paper, the experimental evaluation metrics include FPS (frame/s, representing the detection speed of the algorithm), Precision, Recall, mAP (mean average precision).F1 score is the harmonic mean of precision and recall, used to assess a model performance in classifying both positive and negative samples.GFLOPS (Giga Floating-point Operations Per Second) measures the computational complexity of a neural network model.It is typically considered in conjunction with hardware computational performance to evaluate the computational requirements and speed, and IOU (Intersection over Union) represents the overlap between the ground truth box and the prediction box.All formulas as follows: where TP represents the positive samples that are correctly classified.FP represents the positive samples that are incorrectly classified.FN represents the negative samples that are incorrectly classified.k represents the number of categories.AP i represents the average precision of the current category.
In all experiments in this section, the initial learning rate lr = 0.01, lr is adjusted by cosine annealing during training, the model weight optimizer is SGD.mAP is divided into mAP 50 (mean Average Precision at 0.5 IOU) and mAP 50:95 (mean Average Precision from 0.5 to 0.95 IOU).Usually, mAP 50 is a common IOU threshold used to determine the effectiveness of a detection.mAP 50:95 provides a more comprehensive performance assessment, as it considers multiple IOU thresholds.

Comparison Experiment
In order to verify the detection and real-time performance of RT-YOLO algorithm, we compare it with the classical and state-of-the-art algorithms tested on CrowdHuman in terms of FPS, mAP, GFLOPS and number of parameters.Since the CrowdHuman training dataset is substantial, all algorithms do not use pre-trained weights.Each algorithm is trained for 200 epochs with input size of 640 × 640, and the best training weights are used to calculate results on the CrowdHuman test set.The results are shown in Table 2 and Figure 13.In Table 2, The RT-YOLO includes extra calculation modules, resulting in slightly lower speed compared to YOLOv7, similar to the state-of-the-art YOLOv8-l.However, it still reaches 67 FPS, ensuring real-time detection, with a 3.8% improved  50 and 4.9% improved  50:95 .In terms of model complexity, RT-YOLO has 46.1 fewer GFLOPS and 4.1 million fewer parameters compared to the latest YOLOv8-l.Compared to YOLOv7, both the parameter count and complexity increased, but it achieved a significant improvement in accuracy.Figure 13 shows the PR curves enclosed by Precision and Recall.RT-YOLO stands out with the highest Recall value, indicating the lowest rate of missed detections for pedestrian objects, surpassing other algorithms.In addition, RT-YOLO is deployed on RTX 2080Ti with Tensorrt FP32 precision acceleration.With a slight reduction in mAP, the FPS has increased to 84, still maintaining its practical value.
The YOLOv7, YOLOv8, and RT-YOLO, which perform well in experiments, are applied to real-life scenes and compared for detection effects.The detection results are shown in Figure 14.From the detection results, it can be seen that RT-YOLO excels in practical applications, particularly in multi-scale feature fusion, and enhances the detection of small-scale and occluded objects.In Figure 14I, RT-YOLO successfully detects distant pedestrian objects that are missed by YOLOv7 and YOLOv8-l.;In Figure 14II, RT-YOLO successfully detects crowded pedestrians in the upper part of the image that are missed by YOLOv7 and YOLOv8-l; In Figure 14III, where the objects in the image are blurred and difficult to identify, RT-YOLO successfully detects a greater number of distant and near-field pedestrian objects.In Table 2, The RT-YOLO includes extra calculation modules, resulting in slightly lower speed compared to YOLOv7, similar to the state-of-the-art YOLOv8-l.However, it still reaches 67 FPS, ensuring real-time detection, with a 3.8% improved mAP 50 and 4.9% improved mAP 50:95 .In terms of model complexity, RT-YOLO has 46.1 fewer GFLOPS and 4.1 million fewer parameters compared to the latest YOLOv8-l.Compared to YOLOv7, both the parameter count and complexity increased, but it achieved a significant improvement in accuracy.Figure 13 shows the PR curves enclosed by Precision and Recall.RT-YOLO stands out with the highest Recall value, indicating the lowest rate of missed detections for pedestrian objects, surpassing other algorithms.In addition, RT-YOLO is deployed on RTX 2080Ti with Tensorrt FP32 precision acceleration.With a slight reduction in mAP, the FPS has increased to 84, still maintaining its practical value.
The YOLOv7, YOLOv8, and RT-YOLO, which perform well in experiments, are applied to real-life scenes and compared for detection effects.The detection results are shown in Figure 14.From the detection results, it can be seen that RT-YOLO excels in practical applications, particularly in multi-scale feature fusion, and enhances the detection of smallscale and occluded objects.In Figure 14I, RT-YOLO successfully detects distant pedestrian objects that are missed by YOLOv7 and YOLOv8-l.;In Figure 14II, RT-YOLO successfully detects crowded pedestrians in the upper part of the image that are missed by YOLOv7 and YOLOv8-l; In Figure 14III, where the objects in the image are blurred and difficult to identify, RT-YOLO successfully detects a greater number of distant and near-field pedestrian objects.
practical applications, particularly in multi-scale feature fusion, and enhances the detection of small-scale and occluded objects.In Figure 14I, RT-YOLO successfully detects distant pedestrian objects that are missed by YOLOv7 and YOLOv8-l.;In Figure 14II, RT-YOLO successfully detects crowded pedestrians in the upper part of the image that are missed by YOLOv7 and YOLOv8-l; In Figure 14III, where the objects in the image are blurred and difficult to identify, RT-YOLO successfully detects a greater number of distant and near-field pedestrian objects.In summary, RT-YOLO effectively distinguishes occluded pedestrian boundaries, preserves features of small-scale pedestrians through multi-scale fusion, significantly enhances the detection accuracy of crowded pedestrians, and maintains real-time performance.

Ablation Experiment
In order to verify the performance of the improved components, combinations of each part are trained in ablation experiments.Subsequently, they are tested on the CrowdHuman and WiderPerson datasets.In the table, higher GFLOPS indicates greater model complexity.FPS stands for the detection speed of the algorithm.Params represents the model size, and Layers refers to network depth.F1 score ranges from 0 to 1 and is used to measure the classification ability.
Analyzing Tables 3-5, the RT-YOLO compared to the original algorithm, exhibits an increase in model complexity of 12.6 GFLOPS and an addition of 2 million parameters.This leads to a reduction in speed by 24 FPS.Notably, in CrowdHuman, the overall  50 improves by 3.8%, and  50:95 improves by 5.1%.In WiderPerson, it achieves a 3.4% overall  50 improvement and a 4% improvement in  50:95 .The F1 score also improves on both datasets.In summary, RT-YOLO increases model complexity but significantly improves detection performance and classification performance.In summary, RT-YOLO effectively distinguishes occluded pedestrian boundaries, preserves features of small-scale pedestrians through multi-scale fusion, significantly enhances the detection accuracy of crowded pedestrians, and maintains real-time performance.

Ablation Experiment
In order to verify the performance of the improved components, combinations of each part are trained in ablation experiments.Subsequently, they are tested on the CrowdHuman and WiderPerson datasets.In the table, higher GFLOPS indicates greater model complexity.FPS stands for the detection speed of the algorithm.Params represents the model size, and Layers refers to network depth.F1 score ranges from 0 to 1 and is used to measure the classification ability.
Analyzing Tables 3-5, the RT-YOLO compared to the original algorithm, exhibits an increase in model complexity of 12.6 GFLOPS and an addition of 2 million parameters.This leads to a reduction in speed by 24 FPS.Notably, in CrowdHuman, the overall mAP 50 improves by 3.8%, and mAP 50:95 improves by 5.1%.In WiderPerson, it achieves a 3.4% overall mAP 50 improvement and a 4% improvement in mAP 50:95 .The F1 score also improves on both datasets.In summary, RT-YOLO increases model complexity but significantly improves detection performance and classification performance.In the CrowdHuman dataset, most person labels have dimensions ranging from approximately 50 × 100 to 150 × 400, while head labels have dimensions between 20 × 20 and 150 × 200.These represent two distinct object scales, with person labels addressing challenges related to occlusions, and head labels addressing challenges related to smallerscale objects.According to the results presented in Table 6, RT-YOLO exhibits significant improvements in detecting both label categories.In head category, the overall mAP 50 improves by 3.5%, and mAP 50:95 improves by 4.5%.In person category, there is an overall 4.1% improvement in mAP 50 and a 5.3% improvement in mAP 50:95 .Considering the above tables, we analyze the effects of different components on the performance of RT-YOLO from three aspects:

•
Effect of multi-scale fusion strategy with BOTrans: The addition of a BOTrans object detection layer increases the network layers from 415 to 509.GFLOPS increase from 106.5 to 119.5.The FPS has decreased from 81 to 67, but it still maintains real-time requirements.In CrowdHuman, the overall mAP 50 improves by 2.7%.The mAP 50:95 , which better reflects overall performance, improves by 3.5%.In WiderPerson, the improvements are 2.2% and 2.5%, respectively.In the BOTrans detection layer, selfattention efficiently distinguishes occluded pedestrian boundaries, reducing miss and false detection.During model training, the combination of shallow-level details with deep semantic information, alleviating the issue of effective feature loss due to increasing network depth.The resulting feature maps contain high-level semantic information and shallow-level details, improving the efficiency of detecting smallscale and occluded pedestrians.The additional detection layer increases the overall depth and complexity of the network but brings a significant mAP improvement and provides the basis for BTDetect, so it is acceptable.

•
Effect of BTDetect prediction head: The two large-scale feature maps output by Neck are passed into BTDetect.The Bottleneck residual structure of BTDetect can alleviate the issues of gradient disappearance and feature information loss.Additionally, the Transformer efficiently captures detailed information from the larger-scale feature maps, focusing on small-scale and obscured object features.This effectively enhances the detection performance.Ultimately, BTDetect divides the feature map into grids and decouples within each cell to predict object positions and categories.The overall mAP 50 improvement is 0.8% in the CrowdHuman and 0.7% in the WiderPerson.The mAP 50:95 improvement in CrowdHuman is 1.2%, and in WiderPerson, it is 1.3%.

•
Effect of NAM attention: NAM Attention is a lightweight, plug-and-play module without increasing complexity.The NAM attention in RT-YOLO computes weighted attention on images from channel and space, which reduces the influence of irrelevant regions in the feature map.The overall mAP 50 improvement is 0.3% in the CrowdHuman and 0.5% in the WiderPerson.The mAP 50:95 improvement in CrowdHuman is 0.4%, and in WiderPerson, it is 0.2%.
As shown in Figure 15, YOLOv7 has some degree of missed detection in various complex scenes.RT-YOLO is able to fully extract object feature information, achieving accurate localization and classification of small-scale, obscured objects.In summary, RT-YOLO increases the depth of the network within the acceptable range of computational and complexity increments.This enhances the high-level features with richer semantic information, while the multi-scale fusion training effectively fuses In summary, RT-YOLO increases the depth of the network within the acceptable range of computational and complexity increments.This enhances the high-level features with richer semantic information, while the multi-scale fusion training effectively fuses rich details from low-level features into feature maps, compensating for the loss of detailed features caused by the deep network layers and thereby improving the algorithm's feature extraction capability.

Validity Verification
To verify the effectiveness and robustness of the RT-YOLO algorithm through generalization experiments, experimental data are selected for Pascal VOC2007, which contains 20 categories and can be used to assess the performance of each algorithm.By comparing the RT-YOLO algorithm with other algorithms, the experimental results are presented in Figure 16 and Table 7.According to experimental results, RT-YOLO exhibits excellent performance on Pascal VOC2007, surpassing classical object detection algorithms in all categories.The RT-YOLO achieves a 5.4% improvement in mAP 50 compared to YOLOv7.
Appl.Sci.2023, 13, x FOR PEER REVIEW 18 of 21 rich details from low-level features into feature maps, compensating for the loss of detailed features caused by the deep network layers and thereby improving the algorithm's feature extraction capability.

Validity Verification
To verify the effectiveness and robustness of the RT-YOLO algorithm through generalization experiments, experimental data are selected for Pascal VOC2007, which contains 20 categories and can be used to assess the performance of each algorithm.By comparing the RT-YOLO algorithm with other algorithms, the experimental results are presented in Figure 16 and Table 7.According to experimental results, RT-YOLO exhibits excellent performance on Pascal VOC2007, surpassing classical object detection algorithms in all categories.The RT-YOLO achieves a 5.4% improvement in  50 compared to YOLOv7.In RT-YOLO, BOTrans performs multi-head self-attention computations, handling different features and channels in parallel.This enables the identification of features for small-scale and occluded objects, as well as the extraction of features for large-scale objects.RT-YOLO integrates scale-diverse features across different network layers.It directs  In RT-YOLO, BOTrans performs multi-head self-attention computations, handling different features and channels in parallel.This enables the identification of features for small-scale and occluded objects, as well as the extraction of features for large-scale objects.RT-YOLO integrates scale-diverse features across different network layers.It directs the network's focus to effective features with NAM.These components notably enhance the robustness and generalization performance of RT-YOLO.
The generalization and robust performance of RT-YOLO provide practical applications value in current popular fields.In video surveillance [33], detecting objects at various scales and under occlusion is a significant challenge, and all subsequent tasks rely on successful object detection before region or behavior analysis.In the field of autonomous driving [34], detectors must be efficient and fast, demanding high performance, lightweight algorithms.RT-YOLO's performance serves as a reference for further research and lightweight performance improvements.

Conclusions
This paper introduces the RT-YOLO algorithm, aimed at enhancing the detection accuracy of small-scale and obscured objects in crowded pedestrian scenarios.The powerful BOTrans module combines convolutional network with Transformer structure, replacing the E-ELAN structure in the original network.BOTrans globally extracts features and integrates contextual information to distinguish crowded and occluded objects.Based on the above enhancements, the multi-scale fusion strategy changes to design special network layers for small-scale and occluded objects, allowing them to obtain feature maps with different scale receptive fields.This design ensures that the feature maps have semantic information from high-level features while retaining detailed information from low-level features.In RT-YOLO, the NAM attention mechanism is introduced, allowing the network to focus on the object region of interest efficiently, thereby improving network parameter utilization.Finally, BTDetect decouples the maximum scale feature map to obtain classification and location information for objects, resulting in precise object localization.
In the filtered CrowdHuman and WiderPerson datasets, experimental results demonstrate that RT-YOLO improves mAP 50 by 3.8% and 3.4%, mAP 50:95 by 5.1% and 4% when compared to YOLOv7.Generalization experiments on Pascal VOC2007 validate its robustness and effectiveness, showcasing a 5.1% improvement in mAP over YOLOv7.Although the addition of extra modules slightly increases algorithm speed and complexity, RT-YOLO still meets real-time requirements.From the experimental results, it can be seen that there is room for improvement in RT-YOLO in terms of lightweighting and loss functions.For example, using dilated convolutions instead of regular convolutions can reduce model complexity, although it may lead to a performance drop.Adjusting loss function allocation strategies with Focal Loss or choosing to pair GIOU and CIOU loss functions based on the object type.Subsequent work will continue to achieve better algorithmic performance.

Figure 2 .
Figure 2. The ELAN and E-ELAN structures.

Figure 2 .
Figure 2. The ELAN and E-ELAN structures.

Figure 2 .
Figure 2. The ELAN and E-ELAN structures.

Figure 3 .
Figure 3.The complete structure of RT-YOLO.
Neck enhances feature extraction by merging shallow and deep features, with the aim of combining information from different scales.The feature maps from Backbone are

Figure 3 .
Figure 3.The complete structure of RT-YOLO.

Figure 4 .
Figure 4.The structure of Bottleneck layer.

Figure 4 .
Figure 4.The structure of Bottleneck layer.

Figure 6 .
Figure 6.The structure of Detect.
Appl.Sci.2023, 13, x FOR PEER REVIEW 9 of 21 occluded objects.The acquisition method for Feature Map 2 and Feature Map 1 remains consistent with YOLOv7.In order to observe objects information of interest for each scale of feature map, all objects localization information is presented through gradient heat maps.

Figure 8 .
Figure 8. Gradient heat map of Feature map.

Figure 8 .
Figure 8. Gradient heat map of Feature map.

Figure 13 .
Figure 13.The comparison of algorithm parameters.

Figure 13 .
Figure 13.The comparison of algorithm parameters.

Figure 14 .
Figure 14.RT-YOLO and classical algorithm detection effect comparison.

Figure 14 .
Figure 14.RT-YOLO and classical algorithm detection effect comparison.
Appl.Sci.2023, 13, x FOR PEER REVIEW 17 of 21  50 improvement is 0.8% in the CrowdHuman and 0.7% in the WiderPerson.The  50:95 improvement in CrowdHuman is 1.2%, and in WiderPerson, it is 1.3%.• Effect of NAM attention: NAM Attention is a lightweight, plug-and-play module without increasing complexity.The NAM attention in RT-YOLO computes weighted attention on images from channel and space, which reduces the influence of irrelevant regions in the feature map.The overall  50 improvement is 0.3% in the CrowdHuman and 0.5% in the WiderPerson.The  50:95 improvement in CrowdHuman is 0.4%, and in WiderPerson, it is 0.2%.As shown in Figure 15, YOLOv7 has some degree of missed detection in various complex scenes.RT-YOLO is able to fully extract object feature information, achieving accurate localization and classification of small-scale, obscured objects.(a) YOLOV7 results (b) RT-YOLO results

Figure 16 .
Figure 16.Various categories of Map.

Figure 16 .
Figure 16.Various categories of Map.

Table 1 .
The parameters of ResNet and BOTrans.

Table 6 .
CrowdHuman experimental results for different scale objects.