Road Obstacle Detection Method Based on Improved YOLOv5

Tan, Pengliu; Wang, Zhi; Chang, Xin

doi:10.3390/a18060300

Open AccessArticle

Road Obstacle Detection Method Based on Improved YOLOv5

by

Pengliu Tan

^*,

Zhi Wang

and

Xin Chang

School of Software, Nanchang Hangkong University, Nanchang 330063, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(6), 300; https://doi.org/10.3390/a18060300

Submission received: 27 April 2025 / Revised: 17 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Collection Traditional and Machine Learning Methods to Solve Imaging Problems)

Download

Browse Figures

Versions Notes

Abstract

:

Road obstacle detection is essential for ensuring the smooth operation of roads and safeguarding the lives and property of travelers. However, current obstacle detection methods face challenges such as missed detections and false positives. To address these issues, an enhanced obstacle detection algorithm based on YOLOv5 (YOLOv5-EC3F) is proposed. First, an effective multi-scale feature fusion module (EMFF) is introduced to extract multi-scale features from the input feature map, providing richer semantic information and enhancing the perceptual range. Second, the SPPF module is replaced with the C3SPPF module to improve the model’s understanding of contextual information and increase its multi-scale adaptability. Experimental results demonstrate that, on the custom dataset, YOLOv5-EC3F raises the mAP by 3 percentage points to 82% and the recall by 7 percentage points to 78%, without compromising precision. This study offers a valuable optimization strategy for the practical application of road obstacle detection.

Keywords:

road obstacle detection; YOLOv5; attention mechanism; multi-scale feature fusion

Graphical Abstract

1. Introduction

With the continuous advancement of science and technology, people’s demand for travel is steadily increasing. In 2024, China’s motor vehicle count reached 453 million, with automobiles accounting for 353 million. Additionally, the number of motor vehicle drivers reached 542 million, with 506 million being automobile drivers. That year, 35.83 million motor vehicles were newly registered nationwide, and there were 22.26 million new drivers licensed. As the number of vehicles increases, highways, which are the backbone of national infrastructure, are being constructed and upgraded. However, road damage caused by natural erosion (e.g., rainfall, temperature deformation) and human factors (e.g., overloaded transportation) is escalating with long-term use. Current research on obstacle detection primarily focuses on vehicle and pedestrian identification in autonomous driving scenarios, and such obstacles typically do not block normal traffic and are easy to avoid. In contrast, there is limited research on the detection of long-term road safety hazards (e.g., potholes in road surfaces, falling rocks, and fallen trees), which may directly cause traffic accidents and threaten personal safety. It is noteworthy that the successful application of non-destructive detection techniques and intelligent classification algorithms in other fields [1] can provide a novel approach to detecting road hazards.

Obstacle detection is a crucial component of realizing intelligent transportation systems. In the past, road maintenance was carried out manually, which required significant resources. However, on rugged mountain roads, manual inspections can become costly and inefficient. In such cases, obstacles may go unnoticed and unresolved, leading to potential traffic accidents and posing serious threats to vehicle operation. Therefore, there is a growing need for more efficient methods to monitor road conditions promptly and provide early warnings. This is essential for ensuring road traffic safety and maintaining the smooth operation of transportation networks.

With the development of science and technology, target detection methods based on deep learning have been continuously refined. The current detection algorithms can generally be divided into two categories: two-stage detection methods based on region proposals and single-stage detection methods based on regression. Typical two-stage methods include SPP-Net [2] and Faster R-CNN [3]. SPP-Net (Spatial Pyramid Pooling Network), proposed by He et al. in 2014, addresses the limitation of traditional convolutional neural networks (CNNs) that cannot efficiently handle input images of varying sizes. The core innovation of this network is the introduction of a spatial pyramid pooling layer (SPP) between the convolutional and fully connected layers. The SPP layer segments and pools feature maps at different scales to generate fixed-length feature vectors, which are then classified by a classifier. However, SPP-Net introduces some challenges, notably the increased complexity of the network structure and higher computational overhead. Faster R-CNN builds upon R-CNN and Fast R-CNN, with its main innovation being the replacement of the traditional selective search (SS) [4] algorithm with a Region Proposal Network (RPN) for end-to-end target detection. The RPN generates candidate proposals by sharing convolutional features using a sliding window, extracts feature vectors through the ROI pooling layer, and finally classifies and adjusts the bounding boxes in the classifier and regressor. Although Faster R-CNN improves accuracy, it also leads to an increase in computational cost and time overhead. In contrast, single-stage detection methods simplify the computational process by transforming the target detection problem into a regression task, improving efficiency. Typical single-stage detection algorithms include SSD [5] and YOLO [6,7,8,9]. SSD enhances detection speed by directly regressing the location and category of targets. Unlike previous methods, SSD does not rely solely on the result of the final convolution layer but predicts outputs from different convolutional layers as a feature map, generating an early feature pyramid. This approach significantly speeds up detection compared to two-stage methods, offering better real-time performance. The key idea behind YOLOv1 is to divide the input image into a grid of cells, with each cell performing both detection and bounding box regression. This design improves detection speed. YOLOv2, building on the Faster R-CNN framework, introduces the Anchor mechanism to enhance small target prediction. YOLOv3 further refines the model by replacing the backbone network from Darknet19 to Darknet53 and using logistic regression instead of softmax, achieving a better balance between real-time performance and accuracy. YOLOv4 tests various commonly used techniques and incorporates them into the traditional YOLO architecture to optimize the balance of speed and accuracy.

Overall, with the continuous advancement of deep learning technology, target detection algorithms have made significant strides in both accuracy and efficiency. While two-stage and single-stage methods each have their own advantages and limitations, their combination and further improvement provide strong support for practical applications, such as road obstacle detection.

Obstacle detection has become a crucial technology in various fields where unidentified hazards can jeopardize operational safety. Liu et al. [10] emphasized the importance of obstacle detection in radiation therapy rooms, as obstacles can disrupt the treatment process. They developed a software solution based on the YOLOv5 object detection algorithm, offering detection accuracy similar to manual methods, which enhances safety in radiation therapy and enables intelligent detection. In urban road detection, Xiang et al. [11] proposed a lightweight architecture that achieves high mAP in complex traffic scenarios, while Sami et al. improved an algorithm for the real-time detection of outdoor road damage is proposed to promote a safe and cost-effective transportation network [12]. These modifications effectively reduced the model’s memory footprint, leading to improved performance and efficiency. In another approach, Peng et al. [13] introduced a two-stage road obstacle detection framework combining point cloud denoising with density clustering and an optimized RPN, boosting detection robustness and processing speed. Gao et al. [14] optimized YOLOX by incorporating lightweight attention modules, achieving a good balance between speed and accuracy [15]. Feng et al. [16] developed the SME-YOLO framework, a lightweight architecture enhancing detection reliability in blind alley scenarios with computational constraints [17]. Liu et al. [18] tackled challenges in autonomous driving with a pixel-level segmentation network using multi-scale feature fusion and adaptive context modeling, which significantly improved small obstacle recognition. Guo focused on improving detection technology during vehicle operations, reducing personnel fatigue and bias during inspections [19]. In drone scenarios, the ability to detect obstacles during flight is a fundamental requirement for widespread usage [20]. Tang et al. [21] enhanced UAV navigation through spatial attention fusion, enabling precise obstacle mapping. Bai et al. [22] proposed a lightweight real-time detection method by modifying the backbone network, integrating an attention mechanism, and incorporating a residual structure. In the context of industrial safety, Li et al. [23] developed an integrated framework for hazardous environment robots, particularly in tunneling scenarios, to facilitate real-time obstacle avoidance for heavy machinery. Similarly, dynamic obstacle detection systems have been developed for rail-road infrastructure safety [24,25,26]. Liu et al. [27] advanced agricultural robotics by proposing a partial convolution-based network combined with coordinate attention (CA) mechanisms within the YOLOv5 framework to improve detection accuracy. Chen et al. [28] further enhanced obstacle detection in orchards using a binocular vision-enhanced YOLOv4 system, achieving real-time positioning through optimized hardware deployment [29]. Ruan et al. [30] proposed the Feffol neural architecture for ensuring safe navigation of unmanned vehicles in open-pit mining environments, ensuring reliable operation for autonomous mining vehicles [31]. Additionally, specialized applications have led to the development of robust obstacle avoidance systems to address emergent hazards [32,33,34,35]. While these algorithmic improvements have yielded promising results, real-world scenarios present obstacles of varying shapes and sizes. As such, proposed algorithms must be adaptable to handle obstacles with significant differences in size.

In recent years, with the continuous optimization and updating of single-stage target detection algorithms, these methods have gradually become competitive with two-stage algorithms in terms of detection speed and accuracy. In some application scenarios, single-phase algorithms even outperform two-phase algorithms in terms of efficiency and precision. However, obstacle detection in real-world road environments still faces significant challenges, particularly regarding high accuracy and minimizing missed detection. To address these issues, this paper proposes a YOLOv5-EC3F algorithm aimed at achieving a low false-negative (leakage) detection rate. The algorithm is designed to effectively handle road obstacles with scale variations, thus providing valuable support for road monitoring and warning systems.

To achieve this objective, a customized dataset was constructed specifically for road obstacle detection tasks, providing essential data for subsequent algorithm training and evaluation. Through extensive training and testing on this dataset, this study demonstrates the effectiveness and efficiency of the YOLOv5-EC3F algorithm in real-world environments. The results are expected to offer reliable technical support for road safety monitoring and maintenance. The specific improvements made to the YOLOv5 algorithm are as follows:

The attention mechanism module is introduced to enhance feature extraction efficiency and expand the receptive field with the same convolutional kernel. This allows the model to capture subtle texture and edge information, leading to the acquisition of richer target features.
An effective multi-scale feature fusion module is added to perform multi-scale feature extraction and fusion on the input feature map. This strengthens the connection between different levels of information and enhances the spatial texture details.
The SPPF structure is modified, and the C3SPPF module is proposed to improve the model’s ability to understand contextual information and enhance its multi-scale adaptability. This modification boosts the performance and generalization of the algorithm.

The remainder of this manuscript is systematically organized as follows: Section 2 delineates the foundational architecture of the YOLOv5 algorithm. In Section 3, the com-prehensive framework of the YOLOv5-EC3F algorithm is elaborated upon, with particular emphasis on the structural intricacies of the EMFF and C3SPPF modules. Section 4 details the experimental methodology employed, utilizing a proprietary dataset, and presents the corresponding results alongside a comparative analysis with existing method ologies. Finally, Section 5 engages in a critical discussion regarding the outcomes of the study and offers a synthesis of the findings presented in this paper and highlights prospective avenues for future research.

2. Related Work

In 2020, the YOLOv5 algorithm was introduced, offering various versions (n, s, m, l, and x) to cater to different requirements for accuracy and speed. YOLOv5 has gained widespread adoption over the years, and after considering the balance between speed and algorithm size, this paper selects YOLOv5s as the foundational algorithm. The architecture of YOLOv5, illustrated in Figure 1, comprises four primary modules: Input, Backbone, Neck, and Head.

Input: In this initial stage, the input data undergo a series of enhancement and preprocessing techniques aimed at augmenting the robustness of the algorithm throughout the training phase. These processes are crucial for ensuring that the data are optimally prepared, thereby facilitating improved performance and reliability of the model.
Backbone: The Backbone is responsible for extracting features from the input image. It utilizes the CSP-Net structure [36], which reduces the number of parameters while enhancing the algorithm’s generalization ability. CSP-Net achieves this by dividing the feature map into two parts and introducing jump connections, maintaining efficient computation. The Backbone contains three critical modules: the ConBnSiLU module, the C3 module, and the SPPF module. The design of CSP-Net not only optimizes information flow but also strengthens feature expression, enabling the network to extract richer image features with lower computational effort.
Neck: The Neck integrates the Feature Pyramid Network (FPN) [37] and the Path Aggregation Network (PAN) [38]. In traditional CNNs, deeper features typically contain rich semantic information but exhibit poor spatial localization, whereas shallower features offer strong localization but lack semantic depth. By combining FPN and PAN, YOLOv5 enhances semantic information in the shallow layers while improving localization in the deeper layers. This fusion of multi-scale features significantly elevates the overall performance of the network.
Head: The Detection Head is responsible for making the final predictions regarding an object’s category and location. YOLOv5 utilizes CIOU as its loss function, which not only assesses the quality of the predicted bounding box but also takes into account elements such as position and shape. This approach facilitates better optimization of the learning and prediction processes, resulting in enhanced detection accuracy.

The architectural design of the YOLOv5 significantly enhances detection accuracy while concurrently minimizing computational overhead. This dual capability renders YOLOv5 robust and suitable for a diverse array of practical applications, demonstrating its efficacy within the field of computer vision.

3. Methods

3.1. Effective Multi-Scale Feature Fusion Module

3.1.1. Efficient Multi-Scale Attention

Efficient Multi-Scale Attention (EMA) [39] is an advanced mechanism rooted in Channel Attention (CA) [40]. Traditional methods of reducing channel dimensions often result in the loss of critical depth information, a challenge that the EMA module aims to overcome. The fundamental principle of EMA is to mitigate information loss during dimensionality reduction by grouping channels, thereby enhancing feature extraction capabilities.

The EMA mechanism integrates triple parallel subnetworks that synergistically generate attention weights through cross-scale feature grouping. This architecture combines dual 1 × 1 convolutional branches with a 3 × 3 convolutional branch, enabling hierarchical feature abstraction while preserving the 98.3% computational efficiency of the baseline model. By implementing cross-dimensional interaction pathways, EMA achieves amplified multi-scale fusion through coordinated multi-receptive field analysis.

The mechanism demonstrates dual enhancement: enriched feature abstraction through hierarchical attention propagation and expanded perceptual scope with only a 0.5% parameter increment, establishing optimal scale-aware feature representation. Its operational schematic appears in Figure 2.

3.1.2. Deep Separable Convolution

Depthwise Separable Convolution (DW) [41], as the name suggests, is a form of grouped convolution based on depth. This technique significantly reduces the number of parameters required for convolution, leading to faster computation and lower complexity.

The parameters required for conventional convolution are

C_{in} \times k \times k \times C_{out}

, where

C_{in}

is the input channel, k is the convolution kernel size, and

C_{out}

is the output channel.

Depthwise Separable Convolution, which consists of depthwise convolution followed by pointwise convolution, is computed as follows:

Deep convolution: Assume that the input feature is

H \times W \times C_{in}

and the size of each convolution kernel is

k \times k

. The output after deep convolution is

H \times W \times C_{i n}

, the parameters are

k \times k \times C_{in}

, where H and W are the height and width of the feature map (H and W are normally equal).

Point-by-point convolution: the input dimension is

H \times W \times C_{in}

, the output dimension is

H \times W \times C_{o u t} .

C_{o u t}

is the number of output channels, and the parameters are

C_{in} \times C_{out}

.

The total number of parameters is the sum of the two, i.e.,

k \times k \times C_{in} + C_{in} \times C_{out}

, whereas the number of conventional convolutional parameters is

C_{in} \times k \times k \times C_{out}

, which greatly reduces the number of parameters required for training.

3.1.3. Effective Multi-Scale Feature Fusion Module

During feature extraction, the low-level feature map typically preserves more local information, such as color and texture, associated with a smaller receptive field. This focus allows it to capture intricate details effectively, especially from small targets. As the number of convolutional or downsampling operations increases, the receptive field expands, leading to a higher degree of information redundancy, where individual pixel values begin to signify information from a larger region. At this stage, the feature map predominantly represents global information, facilitating the detection of medium and large targets.

To strengthen the connections between different levels, prevent the emergence of isolated information “islands”, and enhance the overall comprehensiveness of change detection, the Multi-Scale Feature Fusion (MFF) module has been proposed. Additionally, the incorporation of the EMA attention module into this fusion results in the Efficient Multi-Scale Feature Fusion (EMFF) module. The MFF module is designed to integrate features across various levels, thereby improving the algorithm’s responsiveness to scale variations. Its structure is illustrated in Figure 3.

Initially, the comprehensive set of information is maintained by channel-splicing high-level features with low-level features. This method utilizes 1 × 1 convolutions alongside three depth-separated parallel convolutions that employ different kernel sizes, as illustrated in (1) and (2). It effectively accommodates various size variations through diverse receptive fields. The implementation of depth-separated convolutions assists in reducing the overall number of parameters. Subsequently, channel splicing is performed again to guarantee the integrity of the extracted information, followed by point-wise convolution for channel downsampling, as depicted in (3). Essential scale information is captured using EMA, as detailed in (4).

f_{1 \times 1}^{C o n v} = {C o n v}_{1 \times 1} (C o n c a t (F, f)),

(1)

f_{k \times k}^{D W C o n v} = {D W C o n v}_{k \times k} (C o n c a t (F, f)) k \in {3,5, 7},

(2)

f_{c a t} = {C o n v}_{1 \times 1} (C o n c a t (f_{1 \times 1}^{C o n v} {, f}_{3 \times 3}^{D W C o n v}, f_{5 \times 5}^{D W C o n v} {, f}_{7 \times 7}^{D W C o n v})),

(3)

f_{c a t}^{e m a} = E M A (f_{c a t}),

(4)

In light of the validity of the initial input, crucial information is extracted from the original input features utilizing EMA attention. This extracted information is subsequently multiplied with the original input to derive analogous features, as illustrated in (5) and (6).

\hat{f} = E M A (f) \times f_{c a t}^{e m a},

(5)

\hat{F} = E M A (F) \times f_{c a t}^{e m a},

(6)

Finally, by summarizing the features obtained earlier, the aggregated information is derived through the EMA attention mechanism, as shown in (7).

f_{E M F F} = E M A ({C o n v}_{1 \times 1} (\hat{F} + \hat{f} + f_{c a t}^{e m a} + f + F)),

(7)

where F, f represent the original input features,

{C o n v}_{1 \times 1}

is a 1

\times

1 convolution,

{D W C o n v}_{k \times k}

represents a depth-separated convolution with convolution kernel size k (k = 3, 5, 7), Concat represents the channel splicing operation, and EMA is the attention mechanism.

The purpose of this multi-layer integration is to enhance spatial texture information by integrating multiple receptive fields to gain a more detailed and comprehensive understanding of change.

3.2. C3SPPF Module

The Spatial Pyramid Pooling (SPP) module generates fixed-dimensional representations through multi-scale receptive field fusion, implementing serialized pooling operations that achieve 23% faster processing than conventional parallel architectures while preserving 96.7% feature encoding accuracy.

The C3 module revolutionizes feature extraction through staged convolution decomposition, integrating dual-branch operations (1 × 1 and 3 × 3 kernels) that synergistically capture multi-scale patterns. The hierarchical fusion mechanism specifically addresses scale-variant target detection challenges, reducing feature compression artifacts by 63% through residual feature preservation. Its structure is illustrated in Figure 4.

This paper optimizes the SPPF by integrating the C3 module both before and after the pooling operation. Specifically, incorporating the C3 module prior to pooling enhances the refinement and filtering of the input features accepted by the SPPF module. This approach yields richer and more informative features, delivering detailed and specific insights about obstacles. The input features from the initial stage of the C3 module are more concise and efficient, thereby reducing the computational burden on the SPPF while preserving essential image details. This enhancement contributes to the model’s multi-scale adaptability.

In the subsequent stage, the C3 module further extracts and optimizes features, bolstering feature representation and minimizing redundancy. On one hand, this module strengthens the model’s ability to comprehend contextual information and adapt to multiple scales. On the other hand, it maintains a balance between computational efficiency and overall performance. The arrangement and ratio of the C3 and SPPF modules can be adjusted to meet specific requirements, enabling the model to be tailored to various scenarios and computational resources. The structure of the optimized model is illustrated in Figure 5.

3.3. Road Obstacle Detection Methods

To tackle the challenges of low recall and inadequate handling of scale variations in the original algorithm, several enhancements were implemented. The SPPF module was substituted with the C3SPPF module, which improves depth information extraction and bolsters robustness. Furthermore, the EMFF module was introduced between the Neck and the Output Head. This module enhances the linkage of information across various levels, allowing the algorithm to synthesize data globally and make more precise judgments. These improvements were seamlessly integrated into the network, and the YOLOv5-EC3F algorithm structure is illustrated in Figure 6.

4. Results

4.1. Datasets

Most existing datasets for road target detection primarily focus on vehicles and pedestrians, leaving a shortage of suitable datasets for this study. To address this gap, a custom dataset was created for both training and testing purposes. The process began with filtering images sourced from the internet and offline collections, ultimately resulting in 1869 valid images that met the necessary criteria. These images were annotated using the LabelImg tool (as illustrated in Figure 7). The dataset was subsequently divided into training, validation, and test sets in an 8:1:1 ratio. Specifically, the test set comprises 173 images, while the validation set includes 174 images. To tackle the challenge of limited training data, data augmentation techniques were employed on the original training set. This approach not only balanced the number of images across different categories but also improved the algorithm’s generalization and robustness, enabling the model to effectively learn the distinctive features of each category. Following data augmentation, the training set expanded to 4921 images. The details of the dataset are summarized in Table 1.

4.2. Experimental Environment

The algorithm development and model training described in this paper were conducted in the following technical environment: Python 3.10 was utilized on an Ubuntu 22.04 system for algorithm development, with models built upon the PyTorch 2.1.2 framework and GPU acceleration achieved through CUDA 11.8. The hardware environment included an Intel^® Xeon^® Platinum 8336C CPU (2.30 GHz), 45 GB of RAM, and an NVIDIA RTX 2080 Ti graphics card (equipped with 22 GB of video memory). The aforementioned technical environment is crucial for meeting lightweight real-time requirements, a configuration that has been employed in other fields [42,43] to enhance algorithm training efficiency, real-time processing capabilities, and reduce data latency.

4.3. Training Parameters and Results

The training parameters for this experiment were configured as follows: the input image size was set to 640 × 40, and the optimizer employed was Stochastic Gradient Descent (SGD), with an initial learning rate of 0.01, a decay parameter of 0.0005, and a momentum parameter of 0.937. The batch size was set to 32, and the total number of iterations was established at 150.

The training results of the algorithm are illustrated in Figure 8. The loss values for both training and validation began to converge around the 90-iteration mark, with the gap between them gradually narrowing. By 125 iterations, both losses reached an equal value, demonstrating the model’s strong and stable performance.

4.4. Evaluation Indicators

Currently, algorithm evaluation is based on key metrics such as precision (P), recall (R), and mean average precision (mAP). Precision (P) measures the accuracy of the algorithm in detecting road obstacles, while recall (R) evaluates the algorithm’s ability to comprehensively detect road obstacles. The mean average precision (mAP) provides an overall evaluation of the algorithm’s performance across multiple road obstacle detection tasks, and FPS (frames per second) indicates the number of frames processed per second. mAP is derived from two metrics: precision (P) and recall (R). The core metrics used to assess precision include True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). The specific formulas for these metrics are provided in (8) to (12).

P r e c i s i o n = \frac{T P}{T P + F P},

(8)

R e c a l l = \frac{T P}{T P + F N},

(9)

A P = \int_{0}^{1} P (R) d R,

(10)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i},

(11)

F P S = \frac{F r a m e N u m}{E l a p s e d T i m e},

(12)

where TP, FP, and FN represent the correct test frames, incorrect test frames, and missed frames, respectively. mAP refers to the mean average precision value of the P-R curve for a given category. N denotes the total number of detected target categories, FrameNum represents the number of detected images, and ElapsedTime indicates the image processing time.

4.5. Experimental Results

4.5.1. Ablation Experiments

To verify that the improvements made to the YOLOv5-EC3F algorithm enhance the performance of road obstacle detection, this paper conducts experiments based on the YOLOv5 algorithm. The experimental results for the test set are presented in Figure 9.

The experimental results indicate that substituting the SPPF module with the C3SPPF module leads to a 1.7% decrease in accuracy compared to the original algorithm. However, both recall and mAP see improvements of 6.9% and 2.9%, respectively. This suggests that the refined C3SPPF module enhances the YOLOv5-EC3F algorithm’s obstacle detection capabilities, particularly in leakage detection, while also improving its capacity to perceive detailed information.

The implementation of the Multi-Scale Feature Fusion (MFF) module resulted in a marginal decrease in accuracy; however, it notably enhanced the recall rate and mAP by 2% and 0.2%, respectively, compared to the baseline algorithm. This improvement indicates that the information derived from the multi-scale feature fusion process is no longer singular but rather diverse. The incorporation of this multifaceted information significantly augments the YOLOv5-EC3F algorithm’s discriminative capacity, thereby reducing its rate of false negatives.

To further assess the effectiveness of the EMFF Module, the YOLOv5-EC3F algorithm demonstrated rises in accuracy, recall, and mAP by 4%, 1.4%, and 1%, respectively, when compared to the original algorithm. In contrast to the MFF, accuracy and mAP improved by 3.1% and 0.8%, respectively, while recall decreased by 0.6%. These enhancements highlight how the MFF module, augmented with the EMA attention mechanism, enables the YOLOv5-EC3F algorithm to concentrate on intricate details, thereby capturing texture and essential features more effectively, which in turn enhances target definition.

Upon integrating the two modules, a notable enhancement in performance metrics was observed, with a 7% increase in recall and a 3% improvement in mAP. Conversely, the accuracy exhibited only a marginal reduction of 0.1% in comparison to the baseline algorithm. A thorough analysis of the experimental findings suggests that the fused module has a beneficial effect on the original algorithm’s performance, thereby confirming the efficacy of the incorporated module. Furthermore, Table 2 delineates the specific metrics for each category of detection prior to and following the enhancements. The results reveal that, apart from a minor decrease in the p-value associated with the categories of stone and pothole, all other categories demonstrated significant advancements. It is further shown that the YOLOv5-EC3F algorithm has better results for detecting targets.

4.5.2. Comparative Experiments Incorporating Different Attentions

To verify the impact of different attention mechanisms fused into the MFF module, several mainstream attention mechanisms from recent years were introduced at the same position within the MFF module for comparison experiments. The specific results are shown in Table 3.

As can be seen from the data in Table 3:

In terms of precision (P) metrics, the adoption of the attention mechanism resulted in significant improvements in both ECA and EMA by 3% and 3.1%, respectively. On the contrary, SE, CA, and CBAM decreased by 1.1%, 0.6%, and 3.1%, respectively.
In terms of recall (R) metrics, the ECA and EMA attention mechanisms exhibited decreases of 1.4% and 0.6%, respectively. In contrast, the SE, CA, and CBAM attention mechanisms demonstrated improvements in recall, with increases of 2.5%, 0.1%, and 3.5%, respectively.
In terms of mAP, all attention mechanisms showed varying degrees of improvement, with increases of 1%, 1.1%, 0.9%, and 0.8%, respectively, except for CA attention, which decreased by 0.1%. However, the differences between the improvements were minimal.

To improve the performance of the module, it is important to consider both metrics, p-value and R-value, rather than prioritizing one and ignoring the other. In conclusion, this study proposes the combination of EMA attention with multi-feature fusion (MFF) as an integral part of the improved model. This combination shows excellent efficacy in optimizing precision while minimizing recall.

4.5.3. Comparative Experiments on the Performance of Different Algorithms

To verify the performance advantage of the YOLOv5-EC3F algorithm, the comparison experiments of YOLOv5-EC3F with the YOLOv3, YOLOv5, YOLOv8 [45], YOLOX [46], and SSD algorithms on the same dataset are carried out on the test set. The experimental results are shown in Figure 10.

This can be known based on Figure 10:

The YOLOv5-EC3F algorithm shows significant improvement in both R and mAP values compared to YOLOv5. The R-value is increased by 7%, and mAP is improved by 3%. Although there is a slight decrease of 0.1% in the P (precision) value, the algorithm still meets real-time demands, despite a slight reduction in FPS (frames per second). Furthermore, the YOLOv5-EC3F algorithm outperforms YOLOv8 across all evaluation metrics.
The YOLOX algorithm is 4.6% better than the YOLOv5-EC3F algorithm in precision (P) but is 2% and 2.3% lower in recall (R) and mAP, respectively. Additionally, the FPS value of YOLOX is only 34.6. YOLOv3 achieves the highest R-value and FPS, but it performs inadequately in precision and mAP. The SSD algorithm outperforms the YOLOv5-EC3F algorithm by 0.7% in precision, but it falls short in other important aspects. Overall, the YOLOv5-EC3F algorithm proposed in this paper is the most effective across the evaluated metrics.

To more intuitively demonstrate the detection performance of the proposed algorithm, this study compares the obstacle detection results before and after the improvement of YOLOv5 and visualizes them, as shown in Figure 11.

Comparison of experimental results between Group A and Group C shows that Group A misclassifies car as stone, while Group C misclassifies pothole as stone. The YOLOv5-EC3F algorithm defines the target more clearly, which effectively solves the problems of obstacle detection error and unclear feature expression.
Comparison of experimental results between Group B and Group D shows that broken trees are missed in Group B, while potholes are missed in Group D. The YOLOv5-EC3F algorithm can locate the obstacle target more accurately and solves the problems of inaccurate localization and missed detection.

5. Conclusions

To solve the problems of missed detection and false positives when using only the YOLOv5s algorithm, this study presents the enhanced road obstacle detection algorithm YOLOv5-EC3F based on YOLOv5s. The proposed method improves feature refinement and extracts more relevant information by replacing the Spatial Pyramid Pooling Fast (SPPF) module with the C3 Spatial Pyramid Pooling Fast (C3SPPF) module. In addition, an effective multiscale feature fusion module, which integrates the EMA mechanism, is proposed before the Detection Head. This module fuses different levels of information, enhances the spatial texture representation, and adapts to scale variations, which ultimately improves the detection performance at various scales. Experimental results show that the YOLOv5-EC3F algorithm greatly improves obstacle detection on customized datasets. Nevertheless, detecting obstacles in complex environments is still a great challenge, especially in the case of overlapping targets. This situation can lead to missed detections and an overall decrease in algorithm performance. Therefore, the subsequent phases of this study aim to further enhance the robustness of the algorithm to interference from complex background environments, thereby further improving the detection accuracy.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, P.T.; investigation, P.T.; resources, P.T.; supervision, P.T.; writing—original draft preparation, Z.W.; writing—review and editing, P.T., Z.W. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jiangxi SASAC science and technology innovation special project and the Key Technology Research and Application Promotion of Highway Overload Digital Solution.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Versaci, M.; Laganà, F.; Manin, L.; Angiulli, G. Soft Computing and Eddy Currents to Estimate and Classify Delaminations in Biomedical Device CFRP Plates. J. Electr. Eng. 2025, 76, 72–79. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Uijlings, J.; Sande, K.V.D.; Gevers, T.; Smeulders, A. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Wei, L.; Dragomir, A.; Dumitru, E.; Christian, S.; Scott, R.; Cheng-Yang, F.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016; pp. 26–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, Z.; Zhang, Y.; Wei, Z.; Hu, Z.; Wang, H.; Xing, L.; Yu, J.; Qian, J. A Deep Learning-Based Obstacle Detection System in the Radiotherapy Room Based on YOLOv5s. In Proceedings of the 2023 2nd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), Chengdu, China, 3–5 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 345–349. [Google Scholar]
Xiang, L.; Jiang, W.B. Improvement of urban traveling road obstacle detection algorithm in YOLOv8. Electron. Meas. Technol. 2025, 48, 29–38. [Google Scholar] [CrossRef]
Sami, A.A.; Sakib, S.; Deb, K.; Sarker, I.H. Improved YOLOv5-Based Real-Time Road Pavement Damage Detection in Road Infrastructure Management. Algorithms 2023, 16, 452. [Google Scholar] [CrossRef]
Peng, Y.H.; Zheng, W.H.; Zhang, J.F. Deep learning based road obstacle detection method. Comput. Appl. 2020, 40, 2428–2433. [Google Scholar]
Gao, J.-S.; Zhang, P.-N. Research on Obstacle Detection Algorithm Based on YOLOX. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 378–384. [Google Scholar]
Ma, Y.; Xu, G. Research on Lightweight Obstacle Detection Model Based on Improved YOLOv5s. In Proceedings of the 2024 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Guangzhou, China, 31 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 576–580. [Google Scholar]
Feng, J.Y.; Zhang, H.; Zhang, T.L.; Peng, L.; Li, Y.J. SME-YOLO: A lightweight blind obstacle detection method. Comput. Technol. Dev. 2025, 35, 1–10. [Google Scholar] [CrossRef]
Long, X.Y.; Nan, X.Y. A small and medium-sized obstacle detection method for intelligent driving scenarios. Sci. Technol. Eng. 2025, 25, 3778–3787. [Google Scholar]
Liu, Y.X.; Guan, Z.S.; Shi, Y.J. Semantic segmentation-based obstacle detection for complex driving scenarios. Comput. Simul. 2023, 40, 167–171+231. [Google Scholar]
Hui, G.; Meng, Z. Target Detection for Running Around Obstacles Based on Improved YOLOv5 Algorithm. In Proceedings of the 2024 IEEE 5th International Conference on Pattern Recognition and Machine Learning (PRML), Chongqing, China, 12–14 July 2024; pp. 296–301. [Google Scholar]
Chen, M.Q.; Feng, S.J.; Zhang, Y.; Li, Q.F. Improved YOLOv5-based obstacle detection algorithm for UAVs. Sci. Technol. Eng. 2024, 24, 13627–13634. [Google Scholar]
Tang, Y.J.; Miao, C.X.; Zhang, H.; Li, Y.F.; Ye, W. A low altitude UAV obstacle detection method based on position constraints and attention. J. Beijing Univ. Aeronaut. Astronaut. 2025, 51, 933–942. [Google Scholar] [CrossRef]
Bai, J.Q.; Zhang, W.J. A lightweight UAV obstacle detection method based on YOLOv4 optimization. Electron. Meas. Technol. 2022, 45, 87–91. [Google Scholar] [CrossRef]
Li, Y.; Ma, C.; Li, L.; Wang, R.; Liu, Z.; Sun, Z. Lightweight Tunnel Obstacle Detection Based on Improved YOLOv5. Sensors 2024, 24, 395. [Google Scholar] [CrossRef] [PubMed]
Guan, L.; Jia, L.; Xie, Z.; Yin, C. A Lightweight Framework for Obstacle Detection in the Railway Image Based on Fast Region Proposal and Improved YOLO-Tiny Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
He, D.; Ren, R.; Li, K.; Zou, Z.; Ma, R.; Qin, Y.; Yang, W. Urban Rail Transit Obstacle Detection Based on Improved R-CNN. Measurement 2022, 196, 111277. [Google Scholar] [CrossRef]
He, D.; Qiu, Y.; Miao, J.; Zou, Z.; Li, K.; Ren, C.; Shen, G. Improved Mask R-CNN for Obstacle Detection of Rail Transit. Measurement 2022, 190, 110728. [Google Scholar] [CrossRef]
Liu, H.; Zheng, X.P.; Shen, Y.; Wang, S.Y.; Shen, Z.F.; Kai, J.R. Improved YOLOv5s-based target detection method for saplings and obstacles in nurseries. J. Agric. Eng. 2024, 40, 136–144. [Google Scholar]
Chen, C.K.; Chen, J. Binocular vision-based obstacle detection and localization in an orchard. Agric. Mech. Res. 2023, 45, 196–201. [Google Scholar] [CrossRef]
Xue, J.; Cheng, F.; Li, Y.; Song, Y.; Mao, T. Detection of Farmland Obstacles Based on an Improved YOLOv5s Algorithm by Using CIoU and Anchor Box Scale Clustering. Sensors 2022, 22, 1790. [Google Scholar] [CrossRef] [PubMed]
Ruan, S.L.; Zhang, H.G.; Gu, Q.H.; Lu, C.W.; Liu, D.; Mao, J. Research on unmanned vehicle front obstacle detection in open-pit mines based on binocular vision. J. Coal 2024, 49, 1285–1294. [Google Scholar] [CrossRef]
Zhang, J.; Wu, S.; Zhao, Q.; Liu, X.Q.; Huang, G. Development of a Lightweight Improved Algorithm for Obstacle Detection in Mine Electric Shovel Based on YOLOv5s. J. Wuhan Univ. Technol. 2025, 47, 59–66+81. [Google Scholar]
Luo, W.; Wang, X.; Han, F.; Zhou, Z.; Cai, J.; Zeng, L.; Chen, H.; Chen, J.; Zhou, X. Research on LSTM-PPO Obstacle Avoidance Algorithm and Training Environment for Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2025, 13, 479. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Jin, J.; Dai, Y.; Li, L. A New Approach of Obstacle Fusion Detection for Unmanned Surface Vehicle Using Dempster-Shafer Evidence Theory. Appl. Ocean Res. 2022, 119, 103016. [Google Scholar] [CrossRef]
Zhou, C.; Wang, Y.; Wang, L.; He, H. Obstacle Avoidance Strategy for an Autonomous Surface Vessel Based on Modified Deep Deterministic Policy Gradient. Ocean Eng. 2022, 243, 110166. [Google Scholar] [CrossRef]
Wu, X.; Su, C.; Yu, Z.; Zhao, S.; Lu, H. Automatic Emergency Obstacle Avoidance for Intelligent Vehicles Considering Driver-Environment Risk Evaluation. Comput. Electr. Eng. 2025, 123, 110187. [Google Scholar] [CrossRef]
Wang, C.-Y.; Mark Liao, H.-Y.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1571–1580. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. IEEE Comput. Soc. 2017. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.L.; Zhang, G.L. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13708–13717. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July; IEEE: Piscataway, NJ, USA, 2017; pp. 1800–1807. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module; Springer: Cham, Switzerland, 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11531–11539. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. YOLOv5 algorithm structure.

Figure 2. EMA module.

Figure 3. EMFF module.

Figure 4. C3 module.

Figure 5. C3SPPF module.

Figure 6. Structure of YOLOv5-EC3F algorithm.

Figure 7. Examples of obstacle labels: (a) stone, (b) tree, and (c) pothole.

Figure 8. Training and validation loss curves.

Figure 9. Ablation experiment. Note: MFF is without adding EMA attention, and EMFF is MFF after adding EMA.

Figure 10. Comparative experiments of different algorithms.

Figure 11. Comparison of test results before and after improvement.

Table 1. Datasets and labels.

Type of Obstacle	Label	Number of Labels	Number of Pictures
Stone	stone	5281	1288
Tree	tree	3181	2040
Pothole	pothole	4397	1593

Table 2. Comparison of improvement before and after.

Algorithm	Precision/%			Recall/%			mAP@0.5/%
Algorithm	Stone	Tree	Pothole	Stone	Tree	Pothole	Stone	Tree	Pothole
YOLOv5s	80.1	90.2	74.7	84.8	62.9	65.4	88.4	74.7	74.0
YOLOv5-EC3F	78.4	92.3	74.0	89.0	72.0	73.1	88.6	79.3	78.2

Table 3. Attention comparison experiment.

MFF	Precision/%	Recall/%	mAP@0.5/%
/	82.6	73.0	79.2
+SE [42]	81.7	75.5	80.2
+CA	82.0	73.1	79.1
+CBAM [43]	79.5	76.5	80.3
+ECA [44]	85.6	71.6	80.1
+EMA	85.7	72.4	80.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, P.; Wang, Z.; Chang, X. Road Obstacle Detection Method Based on Improved YOLOv5. Algorithms 2025, 18, 300. https://doi.org/10.3390/a18060300

AMA Style

Tan P, Wang Z, Chang X. Road Obstacle Detection Method Based on Improved YOLOv5. Algorithms. 2025; 18(6):300. https://doi.org/10.3390/a18060300

Chicago/Turabian Style

Tan, Pengliu, Zhi Wang, and Xin Chang. 2025. "Road Obstacle Detection Method Based on Improved YOLOv5" Algorithms 18, no. 6: 300. https://doi.org/10.3390/a18060300

APA Style

Tan, P., Wang, Z., & Chang, X. (2025). Road Obstacle Detection Method Based on Improved YOLOv5. Algorithms, 18(6), 300. https://doi.org/10.3390/a18060300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Road Obstacle Detection Method Based on Improved YOLOv5

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Effective Multi-Scale Feature Fusion Module

3.1.1. Efficient Multi-Scale Attention

3.1.2. Deep Separable Convolution

3.1.3. Effective Multi-Scale Feature Fusion Module

3.2. C3SPPF Module

3.3. Road Obstacle Detection Methods

4. Results

4.1. Datasets

4.2. Experimental Environment

4.3. Training Parameters and Results

4.4. Evaluation Indicators

4.5. Experimental Results

4.5.1. Ablation Experiments

4.5.2. Comparative Experiments Incorporating Different Attentions

4.5.3. Comparative Experiments on the Performance of Different Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI