Multi-Scale Target Detection in Autonomous Driving Scenarios Based on YOLOv5-AFAM

: Multi-scale object detection is critically important in complex driving environments within the field of autonomous driving. To enhance the detection accuracy of both small-scale and large-scale targets in complex autonomous driving environments, this paper proposes an improved YOLOv5-AFAM algorithm. Firstly, the Adaptive Fusion Attention Module (AFAM) and Down-sampling Module (DownC) are introduced to increase the detection precision of small targets. Secondly, the Efficient Multi-scale Attention Module (EMA) is incorporated, enabling the model to simultaneously recognize small-scale and large-scale targets. Finally, a Minimum Point Distance IoU-based Loss Function (MPDIou-LOSS) is introduced to improve the accuracy and efficiency of object detection. Experimental validation on the KITTI dataset shows that, compared to the baseline model, the improved algorithm increased precision by 2.4%, recall by 2.6%, mAP50 by 1.5%, and mAP50-90 by an impressive 4.8%.


Introduction
Over the past few decades, autonomous driving technology has emerged as a focal point in the research of intelligent transportation systems and robotics.Within this domain, target detection plays a crucial role, serving as one of the key elements to ensure vehicular safety.In the realm of autonomous driving research, the recognition of multi-scale targets is not only highly challenging but also a critical task for ensuring the precision and reliability of the system.
The development of target detection technology can be generally divided into two key phases: traditional methods and deep learning-based methods.In classical target detection, the Viola-Jones (VJ) detector, introduced by P. Viola and M.Jones, marked a significant milestone as a major breakthrough in the field [1,2].Furthermore, the Deformable Part Models (DPMs) proposed by P. Felzenszwalb, and later improved by R. Girshick, once dominated the detection challenges in VOC 2007VOC , 2008VOC , and 2009 [3-5] [3-5].Due to the swift advancement in deep learning, especially with the widespread implementation of Convolutional Neural Networks (CNNs), the target detection field has experienced a significant transformation.In this revolution, R. Girshick and others introduced the Region-based Convolutional Neural Network (R-CNN) in 2016, incorporating CNN features and bringing a fresh perspective to the domain of target detection [6].Subsequent advancements, such as Fast R-CNN [7] and Faster R-CNN [8], improved detection precision and significantly increased processing speed while reducing computational costs by integrating the Region Proposal Network (RPN).
In the field of target detection technology, the introduction of one-stage detectors such as YOLO [9] and SSD [10] represents significant technological breakthroughs.The advent of RetinaNet further propelled the development in this domain, as it effectively addressed the issue of class imbalance through the adoption of Focal Loss technology, significantly enhancing the model's efficiency and accuracy in processing various types of samples [11].Moreover, RetinaNet conducts object detection at multiple feature levels, further improving detection performance.
The evolution of the YOLO series has been significant in target detection technology.Beginning with YOLOv2 [12] and YOLOv3 [13], the introduction of anchor boxes and multi-scale detection greatly improved the model's accuracy for small objects and complex backgrounds.YOLOv4 [14] and YOLOv5 [15] introduced further innovations, utilizing efficient feature extractors, attention modules, and data augmentation strategies to improve the model's precision and robustness.Subsequently, variants like YOLO-R [16], YOLOX [17], YOLOv6 [18], and YOLOv7 [19] continued to refine the network architecture and algorithms.Particularly in the domain of autonomous driving, these models ensured safe and efficient driving by quickly and accurately identifying and locating vehicles, pedestrians, and traffic signs.The latest, YOLOv8 [20], has further increased detection speed and accuracy, demonstrating the maturity and optimization potential of deep learning in practical applications.
Recent progress in object detection has resulted in the creation of more efficient and precise algorithms, particularly within the realms of autonomous driving and traffic monitoring.Ning and Wang (2022) developed an improved YOLOv5 network tailored for automatic driving scene target detection, achieving notable accuracy in complex driving scenarios [21].Li et al. (2022) investigated the application of standard vision transformer backbones for object detection, contributing to advancements in transformer-based models [22].Avşar and Avşar (2022) utilized deep learning techniques for detecting and tracking moving vehicles at roundabouts, demonstrating the efficacy of trajectory union methods [23].Jeon and Jeon (2022) addressed computational challenges by quantizing the YOLOv5x6 model using TensorRT, achieving faster processing speeds while maintaining comparable detection accuracy to non-quantized models [24].Hamzenejadi and Mohseni (2023) optimized YOLOv5 for real-time vehicle detection in UAV imagery by incorporating architectural enhancements that markedly improved performance [25].Zheng et al. (2023) introduced YOLOv5s FMG, a refined algorithm aimed at detecting small targets in low visibility conditions, which enhances accuracy in challenging environments [26].Despite these advancements, there remains a gap in multi-scale object detection research, highlighting the need for algorithms capable of effectively detecting and tracking objects of various sizes across diverse environments.
To address the issue of multi-scale target detection, this paper proposes an improved YOLOv5-AFAM algorithm, with the main steps as follows: (the link to the code for this article can be found in reference [27]).
(1) The AFAM and the DownC downsampling module are introduced to extract semantic information, thereby enhancing the detection accuracy of small objects.
(2) The EMA module is incorporated to enhance the model's multi-scale detection capabilities.
(3) The MPDIou loss function is utilized to enhance the precision and effectiveness of object detection.

YOLOv5 and Enhanced Model Architecture
The YOLOv5 architecture includes five distinct network scales: YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, and YOLOv5n.These variants are all built on the same framework but differ in the depth and width of CSPNet, leading to variations in model size and parameter count.This architecture is divided into three primary components: the backbone, the neck, and the head.The backbone is tasked with extracting image features through a series of convolutional and pooling layers.The neck network employs a feature pyramid structure [28] for top-down feature extraction, enhancing the detection capabilities for targets of various scales.The head network consists of three detection layers that handle specific tasks such as classification, detection, or segmentation, and utilizes a grid-based anchoring strategy for object detection, particularly improving the detection of small objects.Through the collaborative functioning of these three parts, YOLOv5 achieves efficient and accurate detection of multi-scale targets while maintaining speed and flexibility, making it suitable for a wide range of application scenarios.The network structure of YOLOv5 is illustrated in Figure 1.This paper proposes a novel YOLOv5 model architecture.Firstly, the AFAM is introduced into the C3 modules of the backbone network, creating new C3AFAM modules.Subsequently, the convolution modules in the backbone network are replaced with DownC downsampling modules.In the neck network structure, an additional detection layer is added specifically for small object detection, and EMA modules are incorporated from the second to fifth C3 modules.Finally, features downsampled by factors of 4×, 8×, and 16× are combined and input into the detection head for prediction.The proposed YOLOv5 model architecture is illustrated in Figure 2.

Improvements to the Backbone Section
To precisely implement multi-scale target detection, this study introduces the AFAM [29].AFAM comprises two sub-modules: the Adaptive Channel Module (ACM) and the Fusion Spatial Module (FSM).When integrated into the backbone network of YOLOv5, the AFAM module is embedded within the C3 module.As illustrated in Figure 3, the process begins with the feature mapping F ϵ R C×H×W .As inputs to the network, these feature maps are subsequently passed through the ACM and FSM, sequentially integrating attention information in both channel and spatial dimensions.The mathematical expression is shown in Equation (1); F A represents the features F obtained after processing through the ACM, and F ′ represents the features F A obtained after processing through the FSM.This demonstrates that this structural design not only enhances the model's ability to handle multi-scale features but also improves the accuracy of detecting small objects, thereby effectively boosting the overall performance of object detection.
In this context, f A (•) represents the ACM and f F (•) represents the FSM.⊗ denotes an element-wise multiplication operation.The ACM, illustrated in Figure 4, initiates with parallel convolutions using kernels of three distinct sizes (1 × 1, 3 × 3, and 5 × 5) to capture multi-scale feature information.These convolutional outputs are subsequently integrated and processed via a weighted summation method to generate fused features (denoted as F f ).The mathematical details of this process are provided in Equation ( 2).Following this, the fused features undergo average pooling (Avgpool) and max pooling (Maxpool) operations to produce descriptors that characterize the targets.Finally, these descriptors are further analyzed by a multi-layer perceptron (MLP) to yield the ultimate adaptive receptive field and channel attention features.This process is depicted in Equation ( 3).
In Equation ( 2), Conv(•) signifies the convolution operation using three different kernel sizes, and ω i represents the ith weight factor for each convolution operation.In Equation (3), σ(.) denotes the sigmoid function.AvgPool and MaxPool refer to the average pooling and max pooling operations, respectively.The FSM, as illustrated in Figure 5, inputs the feature maps generated by ACM (denoted as F A ) into three distinct branches in parallel, aiming to extract multi-dimensional feature information, as shown in Equation (4).
In Equation ( 4), f max (•) represents the convolution operation with a kernel size of h × ω.
Subsequently, the output features from these three branches are merged and further refined through a convolutional module (denoted as f C (•)).Afterwards, the merged feature maps are normalized via a Sigmoid activation function to obtain the FSM features F ′ .This process is represented in Equation (5).
In Equation ( 5), F max A , F  To tackle the challenges of detecting small objects and the difficulty in feature extraction, this paper introduces a DownC module.This module enlarges the receptive field and reduces the model's parameter count, which helps prevent overfitting during training.The MaxPool layer within this module decreases computational load while preserving texture features and retaining rich information.
As demonstrated in Figure 6, the DownC module comprises two branches for downsampling.One branch is the "All-Conv" branch, and the other is the "MaxPool-Conv" branch.To combine features from both branches while maintaining the same number of channels, each branch's channels are halved.The "All-Conv" branch consists of two convolutional modules, with the first module having a stride of 1 to halve the channels.The "MaxPool-Conv" branch begins with a max pooling operation, followed by a 1 × 1 convolutional module.In the "MaxPool-Conv" branch, the convolutional module also has a stride of 1 to reduce the number of channels.

Improvements to the Neck Section
As depicted in Figure 7, the EMA [30] method is implemented on any input feature map X ∈ R C×H×W .EMA segments X into G groups of sub-features along the channel axis, facilitating the learning of unique semantic characteristics.This grouping method can be expressed as , where each sub-feature X i belongs to R C/G×H×W .Typically, G is much smaller than C. It is assumed that the learned attention weight descriptors will improve the feature representation capability of regions of interest within each sub-feature group.
EMA derives the attention weight descriptors from the clustered feature maps using three parallel routes, consisting of two 1 × 1 branches and one 3 × 3 branch.The 1 × 1 branches utilize global average pooling operations, while the 3 × 3 branch employs a 3 × 3 convolutional kernel to expand the expression of features.
Specifically, EMA incorporates two tensors that correspond to the outputs of the 1 × 1 and 3 × 3 branches, respectively.Next, it performs 2D global average pooling on the output of the 1 × 1 branch to capture comprehensive spatial information.Before activating the channel features jointly, the output of the 3 × 3 branch is directly reshaped into the corresponding dimensional format, denoted as

3
. The operation of 2D global pooling is illustrated in Equation (6).
In Equation ( 6), z c represents the scalar value obtained after global average pooling of the feature map for a specific channel c.The term 1  H×W acts as the normalization factor for the average pooling, where H and W are the height and width of the feature map, respectively.∑ H j ∑ W i x c (i, j) denotes the summation of all pixel values x(i, j) in the feature map for channel c.This strategy not only enhances the pixel-level attention of CNNs to high-level feature maps but also, through the use of parallel convolution kernels, improves the handling of both short-range and long-range dependencies, making the model both efficient and adaptable to modern architectures.The input feature map is divided into g groups, each processed separately.For each group, the features are passed through an average pooling layer and concatenated.This is followed by sigmoid and softmax activation functions applied to the pooled features.The outputs are then normalized using group normalization.The resultant features are multiplied and summed, producing the final output feature map, which retains the input dimensions h × w × c after passing through a sigmoid activation function.

MPDIou Loss Function
The Minimum Point Distance Intersection over Union (MPDIou) loss [31], proposed by Siliang Maa, Yong Xua, et al., addresses the issues of CIou failure when the actual and predicted bounding box centers coincide and aspect ratios match, as well as the problem of extensive overlap among multiple predicted boxes.MPDIou encompasses all relevant factors considered in existing loss functions, such as overlap or non-overlap area, center point deviation, and deviations width and height, while simplifying the computational process.
This approach directly minimizes the distance between the top-left and bottom-right corners of the predicted bounding box and the ground truth box.During the training phase, the formula forces the bounding box B prd = [x prd , y prd , w prd , h prd ] to approach its actual bounding box B gt = [x gt , y gt , w gt , h gt ]: In Equation ( 7), B gt represents the set of actual annotated boxes, while Θ denotes the parameters used for regression in deep learning.The loss function of MPDIou is defined as shown in Equation (8).
Thus, all factors of the existing bounding box loss function are determined by four points, with the transformation formula presented in Equation (9).The width and height of the actual bounding box are denoted as w gt and h gt , while those of the predicted bounding box are denoted as w prd and h prd .

Dataset
The KITTI dataset, jointly created by the Karlsruhe Institute of Technology in Germany and the Toyota Technological Institute at Chicago, gathers data from various sensors, including cameras and LiDAR, capturing a variety of real-world driving scenarios.
In the KITTI dataset used for this experiment, the 2D image target categories were originally divided into "Truck", "Van", "Tram", "Car", "Person _sitting", "Cyclist", and "Pedestrian".In this study, these categories were simplified and reclassified.The three vehicle types "Truck", "Van", and "Tram" were consolidated into the "Car" category, while "Person _sitting" and "Cyclist" were merged into the "Pedestrian" category.This adjustment was made to facilitate a more effective experimental analysis.

Preparation Work
The configuration required for experiment is as shown in Table 1.Two RTX 4090 GPUs were utilized, and the training was conducted on an Ubuntu 20.04 system using the PyTorch 1.10.0+cu113framework.The experimental parameter settings are shown in Table 2, with the number of training epochs set to 500, the number of categories set to 2, the batch size set at 8, and the image size set to 640 × 640.

Evaluation Metrics
The relevant evaluation metrics for this study include precision, recall, and the mean average precision (mAP) at different thresholds, specifically mAP@0.5 and mAP@0.5:0.95.
The formula for calculating precision is presented in Equation (10).
The calculation of recall is shown in Equation (11).

Experimental Results and Analysis
In this study, a baseline model was constructed, incorporating the DownC downsampling module, AFAM, and specialized layers for small object detection.
The dataset in this study was divided into a training set and a validation set, distributed at a ratio of 8:2.As shown in Figure 8, in Figure 8a  Figure 9 illustrates the detection results of the baseline model.In Figure 9a, all cars and pedestrians in the images are successfully detected with confidence scores exceeding 90%.In Figure 9b, while some distant and smaller targets, such as vehicles and pedestrians, are present, most of them are successfully identified.Only a small portion of these targets are not detected, yet the overall confidence scores remain high.In Figure 9c, the detection results for an image consisting entirely of cars are shown.It is observed that the two cars furthest from the camera were not successfully detected, while the remaining cars were detected with high confidence.In Figure 9d, pedestrians are the dominant class.All pedestrians were successfully detected with confidence scores above 79%, indicating that there is still room for improvement in detection performance.The baseline model data presented in Table 3 show performance variations across different categories.The overall precision is 91.3%, with the Car category achieving a precision of 92.1% and the Pedestrian category at 90.5%.In terms of recall, the overall rate is 83.6%, with a higher 90.1% for the Car category and 77.2% for the Pedestrian category.For the mAP50 metric, the model scores 92.1% overall, with the Car category at 95.9% and the Pedestrian category at 88.3%.In the mAP50-90 the overall performance is 66.47%, with the Car category at 74.2% and the Pedestrian category at 55.2%.
Table 3. Baseline model evaluation metrics.This table presents the precision, recall, and mean average precision (mAP) metrics for different classes using the baseline model.The results are shown for both mAP@50 and mAP@50-90 thresholds.The "All" category represents the overall performance across all classes, while specific metrics are provided for the "Car" and "Pedestrian" classes.This study introduced the EMA module on top of the baseline model.The improved loss function graph is shown in Figure 10.

Class
After the introduction of the EMA module, the loss function graph becomes noticeably smoother and more stable, with minimal fluctuations, especially when compared to the baseline model.Particularly in terms of validation loss, the incorporation of the EMA module results in a smoother trend with almost no significant fluctuations.The improved model demonstrates enhanced generalization capabilities and overall better performance.Table 4 demonstrates that the introduction of the EMA module led to significant improvements across all evaluation metrics.Specifically, the overall precision reached 93.9%, with the precision for the Car category at 94.3% and for the Pedestrian category at 93.4%.In terms of recall, the overall rate increased to 85.9%, with the Car category achieving a high recall rate of 91.5%, and the Pedestrian category reaching 80.3%.For the mAP50 metric, the overall performance was 93.8%, with the Car category reaching 96.7% and the Pedestrian category at 90.8%.Regarding the mAP50-90 metric, the overall performance was 69.1%, with the Car category at 77.9% and the Pedestrian category at 60.4%.
The detection results after integrating the EMA module are shown in Figure 11.In Figure 11a, the confidence scores are significantly improved, with the highest reaching 97%, demonstrating a clear enhancement in performance.In Figure 11b, the improvement is more pronounced for the Car category compared to the Pedestrian category.Figure 11c shows better detection performance for close-range targets, with increased confidence scores.In Figure 11d, overall detection performance is improved, with all targets showing higher confidence scores.These results indicate that the integration of the EMA module leads to a significant enhancement in the overall detection performance of the model.Table 4. Evaluation metrics for the introduced module.This table presents the precision, recall, and mAP metrics for different classes after incorporating the EMA module.The results are shown for both mAP@50 and mAP@50-90 thresholds.The "All" category represents the overall performance across all classes, while specific metrics are provided for the "Car" and "Pedestrian" classes.After adopting the MPDIou loss function, an initial increase in some loss function values was observed, along with significant drops and fluctuations in the early stages of training.However, as the number of training iterations increased, these loss functions eventually stabilized, showing a smooth curve trend.

Class
Figure 13 illustrates the results after integrating the EMA module and replacing the function with MPDIoU.In Figure 13a, the confidence scores for all detected categories exceed 90%, demonstrating better performance compared to just adding the EMA module.Similarly, in Figure 13b, the confidence scores for all detected categories are above 90%, showing significant improvement.Figure 13c highlights the detection of categories that were not identified by the previous two models, with confidence scores exceeding 90% and some reaching up to 97%.In Figure 13d, a slight improvement is observed, with most categories having confidence scores above 88%, except for one category at 45%.These results indicate that replacing the loss function with MPDIoU further enhances detection performance, proving the effectiveness of this improvement.Table 5 shows the changes in evaluation metrics after adopting the MPDIou loss function.The data reveal slight improvements in certain metrics.Specifically, the overall precision was 93.7%, with the precision for the Car category at 94.5% and for the Pedestrian category at 92.8%.In terms of recall, the overall rate was 86.2%, with the Car category at 91.3% and the Pedestrian category at 81.1%.For the mAP50 metric, the overall performance was 93.6%, with the Car category at 96.4% and the Pedestrian category at 90.8%.As for the mAP50-90 metric, the overall performance was 69.5%, with the Car category remaining at 77.9% and the Pedestrian category increasing to 61.2%.After adopting the MP-DIou loss function, the model experienced stable and slight improvements across various evaluation metrics.
Table 5. Evaluation metrics utilizing MPDIoU.This table presents the precision, recall, and mAP metrics for different classes using the modified MPDIoU loss function.The results are shown for both mAP@50 and mAP@50-90 thresholds.The "All" category represents the overall performance across all classes, while specific metrics are provided for the "Car" and "Pedestrian" classes.Figure 14 presents a comparison of detection results.In Figure 14a, the left side shows an incorrect detection of a pedestrian where there is none, and distant vehicles are not successfully detected.Additionally, the pedestrian to the right of the red car in the middle is not detected, while a lamppost is mistakenly identified as a pedestrian, indicating mediocre detection performance.In contrast, Figure 14b corrects the errors observed in Figure 14a by not detecting any pedestrian on the left side.Moreover, successfully detects the distant car that was missed in Figure 14a and correctly identifies the pedestrian to the right of the red car without misidentifying the lamppost as a pedestrian.The comparison between Figure 14a,b demonstrates that the improved model significantly outperforms the baseline model, with notably enhanced detection performance.

Ablation Experiment
To verify the effectiveness of the improvements made, this study conducted a comparative analysis between the baseline model and each improved model.The comparison results are shown in Figure 15.
In Figure 15a, the precision of the baseline model is lower compared to the performance after introducing the EMA and MPDIou modules.Notably, with the initial introduction of the EMA module, there is a significant increase in precision, which then stabilizes at a higher level, consistently outperforming the baseline model.Upon integrating MPDIou, although the initial precision is slightly lower than that of the EMA module, it surpasses the previous two in the later stages of training, demonstrating superiority.In Figure 15b, the introduction of the EMA module significantly improves the recall rate, creating a noticeable gap with the baseline model.After adopting MPDIou, the model similarly surpasses the results with EMA in the later stages of training, showing a stable and gradual upward trend.For Figure 15c, the baseline model experiences fluctuations and a declining trend in the early stages of training, which then gradually and steadily increases.After introducing EMA, the mAP50 metric shows a stable upward trend and a relatively smooth curve, with a significant initial difference from the baseline model, although this difference diminishes later.With MPDIou, the performance exceeds that of the EMA result by the end of training.In the case of Figure 15d, the baseline model also undergoes fluctuations and declines before stabilizing and rising.After integrating EMA, the improvement in the mAP50-90 metric is more pronounced, showing the largest gap with the baseline model.When MPDIou is adopted, the model similarly outperforms the EMA result in the later stages.
As shown in Table 6, after introducing the EMA module to the baseline model, there was an increase of 2.6% in precision, a 2.3% improvement in recall rate, a 1.7% rise in mAP50, and a significant 4.4% increase in mAP50-90.The introduction of the EMA module led to substantial enhancements across all key performance metrics, thereby confirming the significant impact of the improvements made.Upon adopting the MPDIou loss function, the model's performance changes were minor adjustments.Specifically, there was a slight decrease in precision by 0.2%, while the recall rate increased by 0.3%.A minor decline of 0.2% was observed in the mAP50 metric.However, there was a 0.4% improvement in the mAP50-90 metric.These changes reflect the subtle balance and adjustment in the model's performance across different metrics after implementing the MPDIou loss function.Table 7 presents a comparison between the experimental model and YOLOv8.The results indicate that while both models achieve the same accuracy on the validation set, the experimental model surpasses YOLOv8 in recall, with an improvement of 2.7%.Additionally, the experimental model outperforms YOLOv8 by 1.7 percentage points in mAP50.The only metric where the experimental model falls slightly behind YOLOv8 is in mAP50-90, with a decrease of one percentage point.Overall, the experimental model demonstrates superior performance compared to YOLOv8.

Figure 1 .
Figure 1.YOLOv5 Model Architecture.The YOLOv5 framework comprises three core components: the backbone, the neck, and the head.The backbone is composed of multi-scale convolutional layers, batch normalization, and Mish activation functions for feature extraction.The C3 modules enhance feature processing through convolutional and bottleneck layers.The neck employs a feature pyramid structure and upsampling, using Concat operations to integrate features across scales, thereby improving multi-scale target detection efficiency.The head network includes three detection layers designed for classification, detection, or segmentation.

Figure 2 .
Figure 2. Enhanced YOLOv5 model architecture diagram.This diagram highlights the modifications made to the original YOLOv5 architecture shown in Figure 1.In the backbone, the standard downsampling convolution modules are replaced with DownC modules (DownC(1/4), DownC(1/8), DownC(1/16), and DownC(1/32)), and C3 modules are replaced with C3AFAM modules to enhance feature extraction with attention mechanisms.In the neck, EMA modules are added after each C3 module to further improve multi-scale feature integration.Additionally, an extra detection layer (Detect(1/4)) is included in the head to enhance the detection of small objects.

Figure 3 .
Figure 3. Adaptive Fusion Attention Module.The module is composed of two primary components: the ACM and the FSM.The ACM adjusts the feature channels adaptively, enhancing the most informative channels.The FSM then applies spatial attention to the fused feature maps, focusing on the most relevant spatial information.

Figure 4 .
Figure 4. ACM network architecture diagram.It begins with three convolution layers (1 × 1, 3 × 3, and 5 × 5), each weighted by W 1 ,W 2 , and W 3 , respectively.The outputs are concatenated and passed through max pooling and average pooling layers.The pooled features are then fed into an MLP to generate channel-wise attention weights.

Figure 5 .
Figure 5. Schematic diagram of the FSM.The input feature map is processed through max pooling and average pooling layers, followed by a 1 × 3 × 3 convolutional layer.The outputs from these operations are concatenated and passed through additional convolutional layers to generate spatial attention maps.

Figure 6 .
Figure 6.Schematic diagram of the DownC.The input feature tensor, with dimensions C × H × W, is processed through a 1 × 1 convolutional layer followed by a max pooling layer.The resulting feature maps are then processed through a 3 × 3 convolutional layer with a stride of 2 and another 1 × 1 convolutional layer.The outputs of these layers are concatenated and summed, resulting in a feature map with reduced dimensions C/2 × H/2 × W/2.

Figure 7 .
Figure 7. Schematic diagram of the EMA.The input feature map is divided into g groups, each processed separately.For each group, the features are passed through an average pooling layer and concatenated.This is followed by sigmoid and softmax activation functions applied to the pooled features.The outputs are then normalized using group normalization.The resultant features are multiplied and summed, producing the final output feature map, which retains the input dimensions h × w × c after passing through a sigmoid activation function.
9), |C| represents the area of the minimum bounding rectangle that encloses B gt and B prd .The coordinates (x gt c , y gt c ) and (x prd c , y prd c ) represent the central coordinates of the actual annotated bounding box and the predicted bounding box, respectively.

Figure 8 .
displays the loss function graphs for the training set, while Figure 8b shows the loss graph for the validation set.box _loss represents the loss function value for the model in locating targets, obj_loss indicates the loss function value for predicting the presence of targets, and clc_loss denotes the classification loss, measuring the model's performance in classifying targets.In Figure 8a, all loss functions decrease and stabilize over the course of training rounds, indicating gradual improvement in model performance.Although there are some fluctuations in Figure 8b, the overall trend of the loss function downward, suggesting good generalization of the model.Baseline loss function graphs.(a) Training set loss.This graph shows the training loss for box, object, and class predictions over 500 epochs.(b) Validation set loss.This graph shows the validation loss for box, object, and class predictions over 500 epochs.

Figure 9 .
Detection results of the baseline model.(a) Close-up view.Detection results of large targets in close-up scenarios.(b) Distant view.Detection results of small targets in distant scenarios.(c) Car-dominant Image.This image predominantly contains cars, with only a few instances not being detected.(d) Pedestrian-dominant Image.All instances of pedestrians in this image are successfully detected.

Figure 10 .
Incorporation of EMA loss function graphs.(a) Training set loss.This graph shows the training loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module.(b) Validation set loss.This graph shows the validation loss for box, object, and class predictions over 500 epochs with the incorporation of the EMA module.

Figure 11 .
Detection results with EMA integration.(a) Close-up view.Detection results of large targets in close-up scenarios.(b) Distant view.Detection results of small targets in distant scenarios.(c) Car-dominant image.This image predominantly contains cars, with only a few instances not being detected.(d) Pedestrian-dominant image.All instances of pedestrians in this image are successfully detected.To address more complex driving environments, we utilized the MPDIou loss function in the enhanced model.The improved loss function graph is depicted in Figure12.

Figure 12 .
Graphs with modified MPDIoU loss function.(a) Training Set Loss.This graph shows the training loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function.(b) Validation Set Loss.This graph shows the validation loss for box, object, and class predictions over 500 epochs with the modified MPDIoU loss function.

Figure 13 .
Detection results with EMA integration.(a) Close-up view.Detection results of large targets in close-up scenarios.(b) Distant view.Detection results of small targets in distant scenarios.(c) Car-dominant image.This image predominantly contains cars, with only a few instances not being detected.(d) Pedestrian-dominant image.All instances of pedestrians in this image are successfully detected.

Figure 14 .
Comparison of detection results.(a) Baseline model detection results, illustrating the detection performance for each category using the baseline model.(b) Improved model detection results, showcasing the detection performance of the proposed improved model.

Figure 15 .
Improved comparison charts.(a) Precision graph.This shows the precision values over 500 epochs for the baseline model, the model with EMA, and the model with EMA and MPDIoU.(b) Recall graph.This graph depicts the recall values over 500 epochs for the three models.(c) mAP50 graph.This graph illustrates the mAP at 50% IoU over 500 epochs.(d) mAP50-90 graph.This graph shows the mAP across IoU thresholds from 50% to 90% over 500 epochs.

Table 1 .
Experimental platform configuration.The GPU configuration includes two RTX 4090 cards, each with 24 GB of memory.The CPU used is a 24 vCPU Intel(R) Xeon(R) Platinum 8352V running at 2.10 GHz.The operating system is Ubuntu 20.04, with CUDA version 11.3 and cuDNN version 8.2.PyTorch version 1.10.0+cu113 is used as the deep learning framework.

Table 2 .
Experimental parameter settings.The model was trained for 500 epochs.There are 2 categories in the dataset.The batch size used during training is 8, and the input image size is 640 × 640 pixels.

Table 6 .
Comparative experiment table.This table compares the evaluation metrics of three models: baseline, EMA, and MPDIoU.The metrics include precision, recall, and mAP at 50% and 50-90% IoU thresholds.Arrows indicate the change in performance relative to the baseline model.The EMA model shows improvements in precision, recall, and mAP metrics, while the MPDIoU model shows a slight decrease in precision but improvements in recall and mAP.