Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion

Guo, Dudu; Cai, Wenxing; Shuai, Hongbo; Wei, Zhenxun; Chen, Guoliang

doi:10.3390/rs18101461

Open AccessArticle

Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion

by

Dudu Guo

^1,2,*

,

Wenxing Cai

³,

Hongbo Shuai

³,

Zhenxun Wei

^4,5 and

Guoliang Chen

^4,5

¹

School of Traffic and Transportation Engineering, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Green Construction and Smart Traffic Control of Transportation Infrastructure, Xinjiang University, Urumqi 830017, China

³

School of Intelligent Manufacturing Modern Industry, Xinjiang University, Urumqi 830017, China

⁴

Xinjiang Institute of Transportation Science and Technology Co., Ltd., Urumqi 830017, China

⁵

Key Laboratory of Transport Industry of Highway Engineering Technology in Arid Desert Areas, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1461; https://doi.org/10.3390/rs18101461

Submission received: 25 March 2026 / Revised: 2 May 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

An improved YOLOv8 algorithm is proposed for UAV-based road surface defect detection, incorporating four novel modules—FFDPN, IIDH, EIEM, and WaveletPool—to address insufficient feature fusion, detail loss, and small-target aliasing distortion, achieving a 12.2% increase in mAP (83.8% → 96.0%).
With only 2.41 × 10⁶ parameters, the improved model outperforms mainstream detectors, including Faster R-CNN, YOLOv9, and YOLOv11n, in Precision (93.7%), Recall (89.6%), and mAP, while effectively eliminating duplicate detections and missed detections across four defect categories.

What are the implications of the main findings?

The proposed lightweight, high-accuracy model provides a practical solution for automated UAV-based highway pavement inspection, supporting the digital transformation of road maintenance by reducing reliance on manual labor and lowering inspection costs and safety risks.
The design principles of FFDPN and WaveletPool offer transferable methodological insights for multi-scale feature fusion and anti-aliasing downsampling in small-target detection tasks, with broad applicability to other UAV remote sensing object detection scenarios.

Abstract

Unmanned aerial vehicle (UAV) imagery offers a promising alternative to manual and vehicle-based inspection for highway pavement distress detection, but the high-angle perspective reduces the relative size and feature richness of small distresses and amplifies aliasing during downsampling, limiting the accuracy of existing detectors. To address these problems, this paper proposes an improved YOLOv8 algorithm with four coordinated modifications: (i) a Feature-Focusing Diffusion Pyramid Network (FFDPN) that replaces the conventional PAN to strengthen multi-scale feature fusion and preserve fine-grained details; (ii) an Information Interaction Detection Head (IIDH) that replaces the decoupled dual-branch head, sharing interaction features between the classification and regression branches via deformable convolution (DCNv2) to reduce parameters while improving task synergy; (iii) an Edge Information Extraction Module (EIEM) placed at the front of the backbone, which uses Sobel-based gradient response plus max-pooling to inject low-level edge priors; and (iv) a WaveletPool downsampling operator that decomposes features into LL/LH/HL/HH sub-bands to suppress aliasing of small-scale distresses. Experiments on 3408 UAV images of four distress categories (transverse, longitudinal, and alligator cracks and potholes) show that the improved model reaches 93.7% Precision, 89.6% Recall, and 96.0% mAP@0.50—a 12.2 percentage-point gain over YOLOv8n—while using only 2.41 × 10⁶ parameters and outperforming Faster R-CNN, DETR, YOLOv7-tiny, YOLOv9, YOLOv10n, YOLOv11n, and YOLO-World on the same benchmark. The model eliminates the duplicate and missed detections observed in baselines, at a moderate cost in FPS (30.3 vs. 57.1 for YOLOv8n).

Keywords:

intelligent transportation; road surface distress detection; YOLOv8; UAV; small target detection; feature aggregation

1. Introduction

With the rapid development of the transportation industry, the digitalization of highway maintenance technology has become critical, and the digital transformation of highways has emerged as a key pathway to achieving high-quality development in highway transportation [1,2]. Currently, the primary methods for detecting pavement defects remain manual inspection and road inspection vehicle-based detection [3]. Manual inspection methods are characterized by low detection efficiency, lengthy detection cycles, high costs, and the need to close or partially close the site, which disrupts highway traffic operations and poses safety risks to staff, particularly in harsh environmental conditions. Road inspection vehicles, on the other hand, use digital onboard data collection devices to gather road surface information at a certain speed. However, these vehicles can only inspect the road surface conditions of the specific road they are on during a single inspection, resulting in low efficiency for tasks involving the full width of multi-lane highways. Additionally, they rely on human driving, making them less adaptable for road inspections in harsh environmental regions.

Driven by the diverse needs of production and daily life, unmanned aerial vehicle (UAV) remote sensing has been widely applied in waterway transportation [4], remote sensing surveying [5], power line inspections [6], and agricultural pest control [7]. Drones offer the advantages of flexibility and a high vantage point, enabling them to overlook the entire highway section and surrounding facilities, quickly obtain comprehensive road surface defect information, and access dangerous areas inaccessible to humans, thereby ensuring personnel safety. They represent an excellent remote sensing data collection method, and drone technology can effectively overcome challenges such as long maintenance routes, difficult data verification, and complex environmental conditions.

Meanwhile, advancements in artificial intelligence technology, particularly the development of deep learning methods, have made automated and intelligent road surface defect detection possible. However, due to the reduced availability of road surface defect feature information caused by the high vantage point of UAVs, Zhu [8] used YOLOv3 [9], YOLOv4 [10], and Faster R-CNN [11] to detect road surface defects in UAV imagery, but the detection accuracy of these models was not high. To address this issue, Yingchao [12] designed a multi-level attention mechanism that enables the model to integrate information from different positions in the feature map channels, height, and width, enhancing the model’s perception of global information and facilitating the localization and identification of road defects. Additionally, this approach allows the model to focus on more important features of road defects, thereby improving detection accuracy. Additionally, some scholars have pointed out that insufficient feature fusion can reduce the accuracy of road defect detection based on drone imagery. Therefore, they adopted an adaptive multi-scale feature mapping fusion method to address the issue of insufficient feature fusion in YOLOv4, effectively improving the detection accuracy of road defects and enabling rapid identification of the type and location of road defects [13]. Road defects in drone aerial images exhibit diverse morphologies, and small targets may be missed [14]. This is primarily due to feature loss during feature extraction [15] and target aliasing during downsampling [16], which leads to distorted feature representations of small targets. Although methods such as reducing anchor frame stride [17], multi-scale feature fusion [18,19,20], and adding small target detection layers [21] can effectively extract small target features and improve the detection accuracy of road defects, they cannot effectively address the feature distortion caused by the downsampling process in convolutional neural networks.

In summary, current drone-based road defect detection faces issues such as reduced feature information due to high-angle perspectives and low detection accuracy caused by insufficient model feature fusion, as well as missed detections due to feature loss and distortion in small road defects. This paper addresses these issues by designing a drone aerial image road defect detection algorithm based on the YOLOv8 [22] network framework. A Feature-Focusing Diffusion Pyramid Network (FFDPN) is designed to enhance feature fusion capabilities and reduce the loss of detailed information. In addition, an Information Interaction Detection Head (IIDH) is used to reduce the number of parameters and enhance feature interactivity. In the feature extraction part, an Edge Information Extraction Module (EIEM) is added to enhance the edge-detail extraction capability. To address the problem of missed detection, WaveletPool [23] is introduced to reduce the aliasing distortion of road surface defect features.

2. Materials and Methods

2.1. The Oretical Overview of the YOLOv8 Algorithm

The YOLOv8 algorithm is a detection algorithm launched by Ultralytics in 2023. It is a convolutional neural network detection algorithm based on the anchor-free mechanism [24]. YOLOv8 consists of three parts: backbone, neck, and head. The network structure is shown in Figure 1.

YOLOv8’s backbone uses the CSPDarkNet53 network for feature extraction. The Neck is a Path Aggregation Network (PAN), which includes two feature aggregation paths, bottom-up and top-down, that can fuse and enhance multi-scale features. The head classifies and regresses the three different scale feature maps output by the neck, converting the feature maps into categories and detection boxes. YOLOv8 is composed of multiple modules, including Convolution-BatchNorm-SiLU (CBS), Faster Cross Stage Partial Connection Module (C2f), Spatial Pyramid Pooling-Fast Module (SPPF), and Upsampling Module (Upsample). In YOLOv8, the 2D convolution (Conv2d), batch normalization layer (BatchNorm2d), and SiLU activation function serve to extract features, accelerate model convergence, and enhance model expressiveness, respectively. The Upsample module upscales feature maps through interpolation to enlarge their dimensions, facilitating feature fusion. YOLOv8 employs the Nearest Neighbor Interpolation Upsample (NNIU) method to double the size of feature maps.

2.2. Improving the YOLOv8 Road Surface Defect Detection Algorithm

Due to the complex background noise in drone-captured images of road surface defects, the amount of feature information in the images is reduced. Although YOLOv8 has powerful data processing and feature extraction capabilities, there are still issues such as insufficient fusion of road surface defect features, loss of feature information, and missed detections caused by target overlap during the downsampling process. To address these issues, this paper proposes a road defect detection algorithm based on an improved YOLOv8 model. The structure of the improved YOLOv8 model is shown in Figure 2.

The improved YOLOv8 model mainly improves the accuracy of road surface defect feature extraction and detection in drone images through four improvements: (1) Designing a Feature-Focusing Diffusion Pyramid Network, (2) Improving the Information Interaction Detection Head, (3) Adding an Edge Information Extraction Module, (4) Introducing WaveletPool.

2.2.1. Design of Feature-Focusing Diffusion Pyramid Network

In response to the insufficient fusion of road surface defect features and the loss of feature information in the YOLOv8 Path Aggregation Network (PAN), this paper optimizes the feature fusion path of the original model and designs a Feature-Focusing Diffusion Pyramid Network (FFDPN) for multi-scale feature fusion and transmission, diverging the fused features in both upward and downward directions so that the rich contextual information and feature details of road surface defects can be diffused to various detection scales. Through the FocusFeature module, the three different scales of features are spliced together in the channel dimension, enabling the network to learn the features of road surface defects at different scales. The structure of FFDPN is shown in Figure 3.

In Figure 3, FFDPN inputs three features of different scales, P3, P4, and P5, into the FocusFeature module for feature aggregation and spreads the aggregated distress features upward and downward to different detection scales, then transmits them to another FocusFeature module for further spreading. Its structure can be regarded as two structurally identical feature aggregation and spreading parts connected in series. Compared to the original bottom-up and top-down feature fusion networks, FFDPN has a shorter information transmission path, retaining low-level detail information while utilizing high-level semantic information of the damage, thereby avoiding information loss. It also enables each detection scale to fully utilize the contextual information of road damage, alleviating the issue of low-level features being overwhelmed by high-level semantic features. This enhances the model’s adaptability to complex scenes and its ability to identify small-scale road defects.

The core of FFDPN is the FocusFeature module, which takes three feature maps of different scales as input. The height and width of the three feature maps are multiples of 2 and are input in list form. The structure of the road surface defect FocusFeature module is shown in Figure 4.

In Figure 4, X1, X2, and X3 represent three feature maps of different scales, each of which is a four-dimensional tensor. Here, B denotes the batch size during network training, C denotes the number of channels in the feature map, and the constants a, c, and e preceding C are all greater than 0. H and W denote the height and width of the feature map, respectively. For the smaller-scale X1, bilinear interpolation (BIU) and CBS convolution blocks are used for upsampling, scaling its height and width to match those of X2. X3 is downsampled using the ADown module, reducing its height and width to match those of X2. The purpose of the above operations is to ensure that the scale sizes are the same when performing channel concatenation. After scale conversion, concatenation is performed along the dimension corresponding to the number of channels. The spliced feature map X is input into the feature stacking module of multiple convolution kernels of different scales for stacking and element-by-element summation. The result of StackFusion is downsampled by a 1 × 1 CBS convolution block and added to the spliced feature map X through a residual connection to obtain the final FocusFeature feature map. The number of channels, height, and width of the FocusFeature feature map depend on X2.

The principle of bilinear interpolation is shown in Equation (1).

P = q_{l t} • (1 - Δ h_{1}) • (1 - Δ w_{1}) + q_{r b} • Δ h_{2} • Δ w_{2} + q_{l b} • Δ h_{2} • (1 - Δ w_{1}) + q_{r t} • (1 - Δ h_{1}) • Δ w_{2}

(1)

In Equation (1),

P

represents the coordinates of the target point for bilinear interpolation, including the height and width information of the feature map where the target point is located.

q_{l t}

,

q_{r b}

,

q_{l b}

and

q_{r t}

represent the integer coordinates of the upper-left, lower-right, lower-left, and upper-right corners surrounding the target point, respectively.

Δ h_{1}

and

Δ w_{1}

represent the differences between the target point coordinate

P

and the height and width of the upper-left corner

q_{l t}

respectively, while

Δ h_{2}

and

Δ w_{2}

represent the differences between the target point coordinate and the lower-right corner coordinate

q_{r b}

.The magnification factor for bilinear interpolation in this paper is set to 2x. Bilinear interpolation calculates the target point’s value using weighted averaging, resulting in a smoother outcome and avoiding the jagged edges caused by simple rounding, such as those produced by nearest-neighbor interpolation.

The structure of the StackFusion module is shown in the lower right corner of Figure 4. The StackFusion module is a multi-scale pyramid composed of four convolution kernels of different sizes. The sizes of the convolution kernels are shown in the list k = [5, 7, 9, 11] in the figure. The output result of the number of convolution kernel channels is 3cC/2. Before stacking, all feature maps are elevated to a five-dimensional tensor [1, B, 3cC/2, H, W], which are then concatenated along the new dimension to form the tensor [5, B, 3cC/2, H, W]. Finally, the elements are added element-wise along the new dimension to fuse feature maps with different receptive fields, enabling the model to learn rich contextual information while minimizing the loss of low-level detail.

To make the feature-fusion and small-target handling explicit, the FocusFeature operation can be formalized as follows. Let

X_{1} \in ℝ^{(B \times a C \times \frac{H}{2} \times \frac{W}{2})}

,

X_{2} \in ℝ^{(B \times c C \times H \times W)}

and

X_{3} \in ℝ^{(B \times e C \times 2 H \times 2 W)}

denote the three input scales, with

U (\cdot)

the bilinear-interpolation upsampler of factor 2 defined in Equation (1), and

D (\cdot)

the ADown downsampler. The scale-aligned features are

\tilde{X_{1}} = Φ_{u p} (U (X_{1}))

,

{\tilde{X}}_{2} = X_{2}

, and

{\tilde{X}}_{3} = D (X_{3})

, where

Φ_u p

is the CBS block applied after upsampling. They are concatenated along the channel dimension as

X = [{\tilde{X}}_{1}; {\tilde{X}}_{2}; {\tilde{X}}_{3}] \in ℝ^{B \times (a + c + e) C \times H \times W}

. StackFusion then applies a pyramid of K = 4 depthwise convolutions with kernel set k ∈ {5,7,9,11} and sums the responses element-wise,

S (X) = Σ_{k \in {5, 7, 9, 11}} f_{k} (X)

, where

f_{k}

denotes the depthwise convolution with kernel size k. A 1 × 1 CBS reduces the channel count back to cC, and the output of one FocusFeature unit is

Y = ψ_{1 \times 1} (S (X)) + X

, where the residual addition X is broadcast over the reduced channel dimension. FFDPN is the series composition

Y_{o u t} = F \circ F (X_{1}, X_{2}, X_{3})

, where F denotes one FocusFeature unit; the diffused outputs at the three detection scales are obtained by redistributing

Y_{o u t}

through symmetric up- and downsampling. Two consequences of this construction are relevant to small-target detection. First, the residual path X→Y guarantees that every channel of the spliced feature X—including the high-resolution detail channels contributed by X₃—is preserved at the output, so a P3 feature whose discriminative cue spans only ~10 pixels cannot be erased by deeper-layer averaging. Second, summing K = 4 receptive fields with k ∈ {5,7,9,11} approximates a multi-scale aggregation kernel whose effective support covers between 25 and 121 pixels, which brackets the ground-pixel size of the four distress categories at the UAV altitudes used in this paper, so the same fusion unit can simultaneously match a thin transverse crack and a wider pothole without retuning.

Operational stability of the FFDPN is ensured by four design choices that together avoid the classical pathologies of deep multi-branch fusion networks. (i) The residual addition

Y = ψ_{1 \times 1} (S (X)) + X

provides an identity shortcut, so the gradient with respect to the input X has a unit-Jacobian component

\frac{\partial Y}{\partial X} = I + \frac{\partial (ψ_{1 \times 1} \circ S)}{\partial X}

, which prevents the vanishing-gradient behavior that long, fusion-only chains otherwise exhibit. (ii) Each CBS block contains a BatchNorm layer, which normalizes the activation statistics across the spatial dimensions and across the K = 4 stacked kernel responses so that the element-wise sum in StackFusion does not explode in magnitude when several large-kernel branches activate together. (iii) The bilinear interpolation in Equation (1) is differentiable and Lipschitz-continuous with constant 1, so upsampling does not amplify input perturbations; combined with the ADown downsampler, which is a learned 1 × 1 stride-2 projection rather than max-pooling, the scale-alignment stage preserves bounded sensitivity to small input shifts. (iv) The two FocusFeature units are connected in series rather than in a fully cross-coupled graph, which keeps the depth-to-width ratio bounded and prevents the gradient noise that recurrent or densely coupled fusion topologies tend to accumulate. Empirically, in all training runs reported in Section 3, the validation loss decreased monotonically after the first ten epochs; no NaN or divergent loss events were observed across the 300-epoch schedule; The proposed model converges faster and to a lower plateau than the baselines, which we take as additional evidence that the FFDPN-augmented model is numerically and optimizationally stable on this task.

2.2.2. Design of Information Interaction Detection Head

The YOLOv8 Head separates the classification and localization regression tasks, allowing the model to focus on classification and localization tasks separately, thereby improving detection efficiency. However, this results in poor interaction between the two task branches and insufficient fusion of road surface defect features and context information. To address this issue, this paper designs an Information Interaction Detection Head (IIDH) to enhance the interaction between the two task branches of classification and regression. The specific structure is shown in Figure 5.

The IIDH first performs residual connection and splicing on the features output by the Convolution-GroupNorm-SiLU (CGS) to obtain interaction features and then inputs the interaction features into the Categorize Task Module and Regression Task Module, respectively. The classification branch multiplies the category probability distribution and confidence generated by the interaction features after convolution and activation function processing by the output results of the Categorize Task Module to achieve dynamic feature selection, further enhancing the model’s focus on key target areas. The localization branch is the Regression Task Module, which uses a deformable convolutional network [25] (DCNv2) to perform adaptive sampling and weighting of features, generate bounding box parameters, improve the modeling accuracy of road surface defects, and enhance the synergy between the classification and regression tasks. At the same time, interactive features are used to reduce the number of model parameters. The task module structure is shown in Figure 6.

The structure of the Categorize Task Module and Regression Task Module in the IIDH is the same, as shown in Figure 6, including an AdaptiveAvgPool2d layer, two Conv2d layers, ReLU and Sigmoid activation functions, a GroupNorm2d layer, and a SiLU activation function. These components work together to process and transform the input features.

In Figure 6, the interaction features in the categorize branch are processed by convolution and ReLU and Sigmoid activation functions to obtain the modulation weights of the Categorize Task Module features. Through the multiplication of interaction features, the channel weights of the feature map are dynamically adjusted to achieve dynamic modulation of road surface defect features, thereby better adapting to the identification of different types of defects. The interaction features in the localization branch are primarily used to model road surface defects in the target area more precisely. The interaction features obtained from the CGS convolution block are used to generate the offset and mask for the variable convolution network, with a channel ratio of 2:1. The offset and mask are input into DCNv2 together with the results of the Regression Task Module to perform adaptive sampling and weighted calculation, thereby achieving feature enhancement and adaptive spatial transformation. This can effectively improve the attention and accuracy of the Information Interaction Detection Head in classifying and locating road surface defects.

2.2.3. Design of Edge Information Extraction Module

In order to make full use of the feature information of the image, an Edge Information Extraction Module (EIEM) was added to the front end of the YOLOv8 backbone to obtain edge information of road surface defects and gradient change information of the image, enrich the details of the underlying features, and improve the detection accuracy of the road surface defect detection box. The structure of the EIEM is shown in Figure 7.

The EIEM inputs the features from the CBS convolution block into the SobelConv to calculate the gradient of each pixel in the image to obtain the edge information of road surface defects. MaxPool2d retains important features and performs splicing, and finally, the CBS convolution block adjusts the number of channels. SobelConv includes SobelConv_Y in the height direction and SobelConv_X in the width direction. The output result of SobelConv is the sum of the values of the two convolutions mentioned above.

2.2.4. WaveletPool

In order to deal with the target aliasing phenomenon that occurs during network downsampling, WaveletPool is used to replace the CBS convolution block in the YOLOv8 backbone for downsampling. The structure of WaveletPool is shown in Figure 8.

WaveletPool consists of low-pass low-frequency filter operator LL, low-pass high-frequency filter operator LH, high-pass low-frequency filter operator HL, and high-pass high-frequency filter operator HH. Each operator has C channels. The LL operator retains the low-frequency information of the image, that is, the main trend and overall structure of the image. The LH and HL operators retain the gradient information of the image height and width, respectively. The HH operator retains the contour and shape edge information of objects in the image. Wavelet pooling uses the above operators in combination with the frequency characteristics of the image to reduce the feature dimension, better preserve the structural information of the features, reduce the risk of overfitting, and alleviate the information loss caused by the target aliasing of downsampling.

More precisely, given an input feature map

F \in ℝ^{B \times C \times H \times W}

and the Haar low- and high-pass kernels

h = \frac{1}{\sqrt{2}} [1, 1]

and

g = \frac{1}{\sqrt{2}} [1, - 1]

, WaveletPool produces four sub-bands by separable 2D filtering followed by stride-2 sampling:

F_{LL} = (F * h_{x}) * h_{y}

,

F_{LH} = (F * h_{x}) * g_{y}

,

F_{HL} = (F * g_{x}) * h_{y}

,

F_{H H} = (F * g_{x}) * g_{y}

, each in

ℝ^{B \times C \times H / 2 \times W / 2}

. Because the four sub-bands form an orthonormal basis of the input subspace at half resolution, the energy

‖ F ‖^{2} = ‖ F_{L L} ‖^{2} + ‖ F_{L H} ‖^{2} + ‖ F_{H L} ‖^{2} + ‖ F_{H H} ‖^{2}

, so the high-frequency energy that a strided convolution would discard—and that carries the few-pixel boundaries of small distresses—is retained explicitly in

F_{L H}

,

F_{H L}

and

F_{H H}

. From the Nyquist–Shannon viewpoint, the low-pass branch h applied before stride-2 sampling guarantees that the LL sub-band itself does not alias, while the three high-frequency sub-bands are preserved rather than folded back onto LL. This is the formal anti-aliasing property that motivates substituting WaveletPool for the stride-2 CBS block at the small-target detection scales of the backbone, and it is the mechanism by which the model retains the discriminative cues of distresses occupying only tens of pixels in UAV imagery.

3. Experiments and Analysis of Results

3.1. Experimental Setup and Evaluation Criteria

After preprocessing the collected data, 3408 aerial images captured by drones were selected for model experiment verification. The dataset’s categories included transverse cracks, longitudinal cracks, alligator cracks, and potholes, which were divided into training, validation, and test sets in an 8:1:1 ratio. Part of the dataset is shown in Figure 9.

The imagery was collected along national and provincial highway sections in Xinjiang Uygur Autonomous Region, China, using a DJI (DJI, Shenzhen, China) quad-rotor UAV equipped with a 1/2.3-inch CMOS imaging sensor. Flights were conducted at altitudes between 10 m and 30 m above ground level, with camera tilt angles ranging from nadir (90°) to approximately 60° off-nadir, so that the dataset spans multiple UAV altitudes and imaging perspectives commonly encountered in practical inspection. Acquisitions were carried out under clear daylight conditions with the solar elevation above 30° to limit strong directional shadows; the set therefore does not currently include adverse-weather scenes (rain, fog, low light, or overcast). Native image resolution is 1920 × 1080 pixels. To reduce label ambiguity between genuine distresses and visually similar non-distress artifacts such as oil stains, spilled soil, tire marks, and patched asphalt, each image was cross-checked by two annotators, and ambiguous blobs were reverified against higher-magnification inspection photographs; only regions with a coherent crack morphology (linear, branching, or network pattern) or a clearly concave pothole boundary were retained as positive samples.

(1) Experimental environment

The experimental environment used in this paper is shown in Table 1.

(2) Experimental parameter configuration

During the training phase, in order to shorten the training time of the model, this paper adopted a transfer learning strategy. All experiments adopted uniform parameter configurations, with parameter settings as shown in Table 2.

(3) Experimental evaluation indicators

This paper uses Precision (P), Recall (R), Average Precision, mean Average Precision (mAP), model parameter count (Params), model complexity (GFLOPs), and Frames Per Second (FPS) as evaluation metrics to assess the overall performance of the model.

3.2. Ablation Experiment

To validate the optimization effect of the improved strategy proposed in this paper on road surface defects, eight ablation experiments were designed using the control variable method. Based on YOLOv8n, we replaced the FFDPN with the YOLOv8 path feature fusion network, replaced the IIDH with the YOLOv8 detection head, replaced the WaveletPool with the CBS convolution block responsible for downsampling on the YOLOv8 backbone, and added the EIEM. The improved modules were added sequentially and then individually, and the results were compared with the original YOLOv8n, as shown in Table 3.

As shown in the experimental data in Table 3, the model in this paper has achieved significant improvements in terms of P, R, and mAP, with mAP improving by 12.2%. After replacing Group ② with the FFDPN, the overall detection accuracy was improved, and the improved feature fusion strategy was effective, but the model detection speed was reduced. In group ③, the use of the IIDH resulted in superior speed and detection performance compared to the original model. In group ④, the replacement of WaveletPool improved the detection accuracy of the model without significantly improving the number of model parameters, indicating that this module can effectively retain key feature information during the downsampling process. In group ⑤, the EIEM significantly improved the detection accuracy of the original model. The overall improved model has fewer model parameters, but the improved model has a lower calculation speed and more parameters than the original model.

3.3. Comparison Experiment

To compare the effectiveness of the improved algorithm, this paper compares it with other mainstream object detection models, including the two-stage Faster R-CNN, the single-stage YOLO series algorithms (YOLOv7-tiny~YOLO11n) and the YOLO-Word [26] algorithm, as well as the DETR [27] object detection model based on the Transformer architecture. The experimental results are shown in Table 4.

As shown in Table 4, the model proposed in this paper significantly improves various accuracy metrics at the cost of sacrificing some detection speed, outperforming other mainstream object detection models. Among them, the earlier Faster R-CNN, the YOLOv7-tiny [28] algorithm, YOLOv10n [29] developed by the Tsinghua University team, and DETR based on the Transformer architecture perform poorly in terms of detection accuracy for road surface defects. The YOLOv9 [30] algorithm performs better than other models in terms of detection accuracy for road defects, but it still falls short of the model proposed in this paper, and its detection speed is the slowest among all algorithms. The YOLO11n [31] model, YOLO-Word, and YOLOv8n models, although faster than the model proposed in this paper, have detection accuracy that is still far below the improved YOLOv8n, making them unable to achieve effective detection of road surface defects. In summary, the improved model proposed in this paper significantly improves the accuracy of road surface defect detection while maintaining a certain detection speed.

The loss curve of the comparison experiment is shown in Figure 10.

It can be seen that the improved model in this paper has lower loss and faster convergence speed compared to other mainstream object detection models.

3.4. Detection Results

To visually demonstrate the detection performance of the algorithm before and after improvement, this paper conducted detection tests on both the UAV-PDD2023 [32] dataset and a self-built dataset. The visualized results after detection are shown in Figure 11.

During alligator crack detection, as shown in Figure 11a, various mainstream object detection models exhibited duplicate detection and missed detection phenomena. As shown in Figure 11b, some models (Faster R-CNN, YOLOv7-tiny) exhibited misidentification phenomena. However, the improved algorithm in this paper avoids the above phenomena and achieves the highest detection accuracy and more precise localization. During transverse crack detection, as shown in Figure 11a, some mainstream object detection models (Faster R-CNN, YOLOv7-tiny, YOLOv10n) exhibited missed detection phenomena. The improved YOLOv8 achieved an 11% higher detection accuracy than the original YOLOv8n model. In pothole detection, as shown in Figure 11b,c, YOLOv10n exhibited missed detection phenomena, while some models (Faster R-CNN, YOLOv7-tiny, YOLOv9, YOLO-Word) exhibited duplicate detection phenomena. The improved YOLOv8 model avoids these phenomena while maintaining high detection accuracy; In longitudinal crack detection, as shown in Figure 11c, some models (Faster R-CNN, YOLOv7-tiny, DETR, and YOLOv8n) exhibited duplicate detection phenomena. The improved YOLOv8 model in this paper avoids these phenomena while maintaining high detection accuracy. The model proposed in this paper can correctly detect road defects in all three figures and maintains high detection accuracy. Overall, the improved YOLOv8 model demonstrates better detection performance compared to other models.

4. Discussion

4.1. Why the Proposed Modules Improve Detection

The ablation results in Table 3 can be interpreted mechanistically rather than as isolated engineering gains. FFDPN improves mAP by +3.2 points on top of YOLOv8n because the series-connected twin FocusFeature units shorten the information path between P3, P4, and P5 relative to the bottom-up/top-down PAN: every detection scale sees fused low-level texture and high-level semantics in a single hop, so small distresses whose discriminative cues live in the P3 texture channel are no longer drowned out by P5 semantics during repeated up-down resampling. This directly addresses the dominant failure mode of UAV imagery, in which the target occupies only tens of pixels and must survive several fusion stages with its detail preserved. IIDH contributes in two distinct ways: structurally, sharing a single CGS-derived interaction feature between the classification and regression branches reduces parameter count from 3.01 × 10⁶ to 2.24 × 10⁶, because two previously independent branch stacks are replaced by one shared feature plus two lightweight task modules; functionally, the DCNv2 offsets in the regression branch are modulated by the confidence distribution of the classification branch, so a low-confidence candidate is less able to silently drag the box regressor toward a spurious location, which is consistent with the +2.9-point mAP and +2.8-point Recall gain of Group ③ and with the reduction in duplicate boxes visible in Figure 11. EIEM contributes the largest single-module mAP gain (+5.9) because the Sobel gradient response provides a translation-equivariant edge prior that convolutional stacks learn only slowly from scratch; by injecting explicit horizontal and vertical gradient maps at the front of the backbone, EIEM gives the network a ready-made signal for the thin, high-contrast boundaries that define cracks, which is exactly the feature type that noisy aerial backgrounds tend to attenuate. WaveletPool is the smallest individual contributor (+0.7 mAP alone), but its value is clearest in combination: by decomposing each downsampling step into LL/LH/HL/HH sub-bands, it retains the high-frequency edge energy that strided convolutions irreversibly discard, so when FFDPN later fuses scales, the P4 and P5 features still carry the boundary information that EIEM injected at P3. The four single-module gains (3.2 + 2.9 + 0.7 + 5.9 = 12.7) are close to the full combined gain of 12.2, indicating that the modules are approximately additive rather than strongly synergistic or redundant on this dataset; their complementarity shows up more clearly in the qualitative behavior of Figure 11, where only the full combination simultaneously eliminates duplicate, missed, and misclassified detections across all four distress categories.

4.2. Comparison with State-of-the-Art Detectors

Against the SOTA baselines in Table 4, our model’s distinguishing property is that it is simultaneously the most accurate (mAP 96.0%) and among the most compact (2.41 × 10⁶ parameters). The two-stage Faster R-CNN and the Transformer-based DETR achieve at most 78.7% mAP; their region-proposal or set-matching heads were designed for medium-to-large instances in natural images and are poorly matched to the tens-of-pixels distresses that dominate our dataset. YOLOv9 attains competitive accuracy (92.4% mAP) but at 50.7 × 10⁶ parameters and 5.8 FPS, which is roughly 21× heavier and 5× slower, making it ill-suited for on-board UAV deployment. Lightweight baselines such as YOLOv10n and YOLOv11n run faster (66.7 and 41.0 FPS) but trade between 9 and 28 points of mAP for that speed. YOLO-World inherits an open-vocabulary text encoder that is unnecessary for a closed four-class distress taxonomy and costs accuracy. Our model lies on a distinctly better point of the accuracy–efficiency frontier for this task: it gives up roughly 27 FPS relative to YOLOv8n (57.1 ⟶ 30.3) in exchange for +12.2 mAP, a trade-off governed primarily by the added cost of DCNv2 inside IIDH and the four-band convolutions of WaveletPool. At 30.3 FPS the model is sufficient for post-flight batch analysis and for real-time analysis of survey UAV video streams at typical 24–30 FPS capture rates; it is not intended for applications requiring higher frame rates such as aggressive flight-control loops, and we position it as an inspection tool rather than a control-loop perception module.

We also note that recent advances in structured and attribute-aware representation learning—for example, the size-aware graph embedding approach to remote sensing image captioning [33] and the reversible visual state-space modeling for global structure relations [34]—offer complementary directions for richer scene-level reasoning in UAV imagery. Our framework is detection-focused and does not explicitly exploit such structural priors, but integrating them into the IIDH interaction features is a promising extension.

4.3. Distinguishing Distresses from Visually Similar Non-Distress Artifacts

On highway pavement, the visual pattern most easily confused with a crack or pothole is a dark, low-reflectance blob—typically an oil leak, soil contamination, tire mark, or an asphalt patch. Two properties of the proposed pipeline help discriminate these: first, EIEM responds to steep spatial gradients, so a stain with a diffuse, low-contrast boundary produces a much weaker edge map than a crack whose boundary is a sharp discontinuity of surface albedo; second, WaveletPool’s LH, HL, and HH sub-bands collectively preserve horizontal, vertical, and diagonal high-frequency edge energy, which retains the thin oriented boundaries that define linear or branching (crack) and concave-closed (pothole) structures, while blurry, low-gradient stain silhouettes contribute little energy to these sub-bands and are therefore attenuated during downsampling. The annotation protocol described in Section 3.1 reinforces this behavior at the label level by excluding blobs without coherent crack morphology. In the visualization of Figure 11, the improved model does not produce the false-positive boxes on shadow-like regions that appear in the YOLOv7-tiny and Faster R-CNN outputs for Figure 11b. That said, a residual failure mode remains when a linear stain (e.g., a long tire-rubber streak) happens to mimic crack geometry—in that case only high-resolution inspection can resolve the ambiguity, and we report this explicitly as a limitation rather than a solved problem.

4.4. Generalization, Failure Modes, and Limitations

The dataset spans UAV altitudes of 10–30 m and camera tilt angles from nadir to approximately 60° off-nadir (Section 3.1), and the improved model remained effective throughout this range in our validation and test splits, which suggests that the EIEM + WaveletPool edge-preservation chain is robust to moderate changes in instantaneous ground sampling distance. However, three failure modes are important to report candidly. (i) Low-altitude, steeply oblique views (tilt > 60°) introduce strong perspective foreshortening that distorts a transverse crack into a near-horizontal sliver; the anchor-free head inherited from YOLOv8 tends to regress an over-elongated box in this regime, and the DCNv2 offsets inside IIDH partially but not fully compensate. (ii) Strong directional illumination—for example, a low sun angle casting long lane-marker shadows—produces a Sobel response indistinguishable from a real crack at the EIEM stage, which can induce false positives; because our collection protocol avoided solar elevations below 30°, this regime is under-represented in training and therefore under-validated. (iii) Adverse weather (rain, fog, low light, overcast) is absent from our current dataset for the same protocol reason; consequently, the quantitative comparison with YOLOv8n under such conditions that Reviewer 3 asked about cannot be performed on the present data without risk of misleading conclusions, and we treat it as open work. For the same reason, rain-streak simulation or domain-adaptation experiments were not included in this paper; we plan to collect an adverse-weather extension and benchmark the full model under the same protocol in follow-up work. We consider this honest scoping preferable to reporting figures on synthetically perturbed imagery whose statistics do not match real weather conditions.

Remote-sensing satellite imagery is sometimes proposed as an alternative acquisition channel; however, optical satellites rely on visible-light imaging, are restricted to daytime operation, and are strongly impacted by clouds, rain, and fog, which limits their temporal and spatial coverage for pavement inspection. UAV inspection avoids these limits at the cost of a high-angle perspective that our modules were specifically designed to compensate for.

4.5. On Novelty and the Engineering–Theory Trade-Off

We do not claim novelty at the level of new primitive operators. DCNv2, Sobel convolution, and wavelet decomposition are each pre-existing. What is novel, and what the ablation quantifies, is the architectural claim that these particular operators, placed at specific network locations with the specific couplings described above (edge prior at the stem, aliasing-safe downsampling, series-diffusion fusion across scales, and shared-feature interaction between heads), produce a chain of mutually reinforcing effects for the dominant failure mode of UAV pavement imagery—small targets with thin edges at variable scale. The paper’s contribution is therefore positioned as task-specific architectural design rather than a new theoretical operator, and we believe this positioning is useful to the community because it converts four independently known techniques into a deployable UAV-side inspection model with a clearly documented accuracy–compute profile.

With 2.41 × 10⁶ parameters and 10.4 GFLOPs, the model is within the budget of modern UAV companion boards (e.g., edge-AI SoCs such as NVIDIA Jetson Orin Nano or Rockchip RK3588), so on-board edge-computing deployment is technically feasible. We have not yet ported and profiled the model on such hardware, which we also state explicitly as future work rather than an implied contribution.

5. Conclusions

To address the issue of lower-level features being overwhelmed by higher-level semantic features during feature fusion in the target model, leading to insufficient fusion, loss of detail, and aliasing distortion of small-scale road defect features during downsampling, which can cause missed detections, the YOLOv8 algorithm was used as the base algorithm, and its structure was improved. For the problems of insufficient fusion and loss of detail, we replaced the path aggregation network of the YOLOv8 model with the FFDPN and replaced the dual-branch detection head with the IIDH to reduce the number of parameters in the model. At the same time, we added the EIEM to the feature extraction part to enhance the model’s edge detail extraction capabilities. For the small-scale road surface defect features that are prone to aliasing distortion in the downsampling process, a WaveletPool module was introduced in the downsampling part, which significantly improved the model’s detection accuracy of road surface defects. The experimental results show that the detection accuracy of the improved model has been improved by 12.2% and is well suited for the task of detecting road surface defects from a drone’s perspective.

Future work will consider potential issues in road defect detection under different weather conditions and drone perspectives, annotate more road defect data, further enhance the robustness and generalization of the improved YOLOv8 algorithm, and explore how to deploy the network model into drone systems.

Author Contributions

Conceptualization, D.G. and W.C.; methodology, W.C. and H.S.; software, D.G. and H.S.; validation, H.S.; formal analysis, W.C., Z.W. and D.G.; investigation, D.G. and G.C.; resources, D.G., Z.W. and G.C.; data curation, D.G.; writing—original draft preparation, W.C.; writing—review and editing, D.G.; visualization, W.C.; project administration, D.G.; funding acquisition, D.G., Z.W. and G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Autonomous Region Key Research and Development Program Project under Grant 2022B01015; In part by the Science and Technology Program of the Ministry of Public Security 2024JSM04; In part by the Open Project of National Engineering Research Center for Road Traffic Safety Management and Control Technology (2026DYZX003).

Data Availability Statement

All codes, data, and materials included in this research are available upon request from the corresponding author.

Acknowledgments

The authors thank their colleagues working with them at Xinjiang University. The authors would also like to thank the anonymous reviewers of this article for their constructive comments and suggestions.

Conflicts of Interest

Author Zhenxun Wei and Guoliang Chen were employed by the company Xinjiang Institute of Transportation Science and Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
FPN	Feature Pyramid Network
WASPP	Waterfall Atrous Spatial Pyramid Pooling
SE	Squeeze and Excitation
IoU	Intersection over Union
MIoU	Mean Intersection over Union
Params	Parameters

References

Kano, E.; Tachibana, S.; Tsuda, K. Analyzing the Impact of Digital Technologies on the Productivity of Road Maintenance Operations. Procedia Comput. Sci. 2022, 207, 1623–1632. [Google Scholar] [CrossRef]
Renzi, E.; Trifarò, C.A. Knowledge and Digitalization: A Way to Improve Safety of Road and Highway Infrastructures. Procedia Struct. Integr. 2023, 44, 1228–1235. [Google Scholar] [CrossRef]
Zhang, C.; Nateghinia, E.; Miranda-Moreno, L.F.; Sun, L. Pavement Distress Detection Using Convolutional Neural Network (CNN): A Case Study in Montreal, Canada. Int. J. Transp. Sci. Technol. 2022, 11, 298–309. [Google Scholar] [CrossRef]
Wang, J.; Zhou, K.; Xing, W.; Li, H.; Yang, Z. Applications, Evolutions, and Challenges of Drones in Maritime Transport. J. Mar. Sci. Engineering 2023, 11, 2056. [Google Scholar] [CrossRef]
Yan, J. Research on the Application of UAV Remote Sensing Technology in Surveying and Mapping Engineering Survey. In Springer Proceedings in Physics; Springer Nature: Singapore, 2022; pp. 385–394. [Google Scholar]
Wang, C.; Pei, H.; Tang, G.; Liu, B.; Liu, Z. Pointer Meter Recognition in UAV Inspection of Overhead Transmission Lines. Energy Rep. 2022, 8, 243–250. [Google Scholar] [CrossRef]
Li, L.; Hu, Z.; Liu, Q.; Yi, T.; Han, P.; Zhang, R.; Pan, L. Effect of Flight Velocity on Droplet Deposition and Drift of Combined Pesticides Sprayed Using an Unmanned Aerial Vehicle Sprayer in a Peach Orchard. Front. Plant Sci. 2022, 13, 981494. [Google Scholar] [CrossRef]
Zhu, J.; Zhong, J.; Ma, T.; Huang, X.; Zhang, W.; Zhou, Y. Pavement Distress Detection Using Convolutional Neural Networks with Images Captured via UAV. Autom. Constr. 2022, 133, 103991. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Zuo, Z.; Xu, X.; Wu, J.; Zhu, J.; Zhang, H.; Wang, J.; Tian, Y. Road Damage Detection Using UAV Images Based on Multi-Level Attention Mechanism. Autom. Constr. 2022, 144, 104613. [Google Scholar] [CrossRef]
Wang, T.; Cui, Z.; Li, X. AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes. In Lecture Notes in Computer Science; Springer Nature: Singapore, 2025; pp. 72–85. [Google Scholar]
Yan, X.; Sun, S.; Zhu, H.; Hu, Q.; Ying, W.; Li, Y. DMF-YOLO: Dynamic Multi-Scale Feature Fusion Network-Driven Small Target Detection in UAV Aerial Images. Remote Sens. 2025, 17, 2385. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar]
Ning, J.; Spratling, M. The Importance of Anti-Aliasing in Tiny Object Detection. arXiv 2023, arXiv:2310.14221. [Google Scholar]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A Global-Local Self-Adaptive Network for Drone-View Object Detection. IEEE Trans. Image Process. 2021, 30, 1556–1569. [Google Scholar] [CrossRef] [PubMed]
Cai, D.; Lu, Z.; Fan, X.; Ding, W.; Li, B. Improved YOLOv4-Tiny Target Detection Method Based on Adaptive Self-Order Piecewise Enhancement and Multiscale Feature Optimization. Appl. Sci. 2023, 3, 8177. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the Presented at the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. arXiv 2021, arXiv:2103.09136. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO[EB/OL], Version 8.0.0.; AGPL-3.0 License; Ultralytics YOLO: London, UK, 2023. [Google Scholar]
Ferrà, A.; Aguilar, E.; Radeva, P. Multiple Wavelet Pooling for CNNs. In Computer Vision–ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11132, pp. 671–675. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv 2024, arXiv:2401.17270. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the Presented at the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11n[EB/OL], Version 11.0.0.; AGPL-3.0 License; Ultralytics YOLO: London, UK, 2024. [Google Scholar]
Yan, H.; Zhang, J. UAV-PDD2023: A Benchmark Dataset for Pavement Distress Detection Based on UAV Images. Data Brief 2023, 51, 109692. [Google Scholar] [CrossRef] [PubMed]
Ni, Z.; Xu, Y.; Zhang, W.; Zong, Z.; Ren, P. A Size-Aware Graph Embedding Approach to Remote Sensing Image Captioning with Object Relative Size Information. IEEE Trans. Geosci. Remote Sens. 2026, 64, 1–13. [Google Scholar] [CrossRef]
Ma, P.; Fu, Y.; Lyu, J.; Liu, Z. Understanding Global Structure Relation via Reversible Visual State Space Model for Robust Cross-View Geo-Localization. In Proceedings of the 3rd International Workshop on UAVs in Multimedia: Capturing the World from a New Perspective (UAVM ‘25), Dublin, Ireland, 27–31 October 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 42–46. [Google Scholar] [CrossRef]

Figure 1. The YOLOv8 model structure. Arrows indicate the data flow direction between modules.

Figure 2. The improved YOLOv8 model structure.

Figure 3. The Feature-Focusing Diffusion Pyramid Network structure.

Figure 4. FocusFeature module structure. Arrows indicate feature flow; rectangles represent computation modules.

Figure 5. The Information Interaction Detection Head structure. Arrows indicate feature flow; ⊕ denotes element-wise addition.

Figure 6. The task module structure. Arrows indicate the data flow; layer abbreviations are defined in Section 2.2.2.

Figure 7. The Edge Information Extraction Module.

Figure 8. The WaveletPool structure. Arrows indicate the decomposition flow.

Figure 9. Dataset example.

Figure 10. Comparison of loss curves in mainstream object detection models.

Figure 11. Comparison of detection performance. (a) Alligator crack detection; (b) Pothole and transverse crack detection; (c) Longitudinal crack detection.

Table 1. Experimental environment.

Name	Parameters
Operating System	Windows 10
Processor	AMD Ryzen 9 5950X
Graphics Card	NVIDIA GeForce RTX 3090Ti
RAM	32G
Development Language	Python 3.9
Development Environment	Pycharm 2021
Network Architecture	Pytorch 1.10
CUDA Toolkit	CUDA 11.3

Table 2. Training parameter configuration.

Hyperparameters	Value
Image input size settings	1920 × 1080 × 3
Optimizer	SGD
Initial learning rate	0.01
Weight decay coefficient	5 × 10⁻⁴
Momentum parameter	0.937
Batch size	4
Number of training rounds	300

Table 3. Ablation experiment results.

Group	FFDPN	IIDH	WaveletPool	EIEM	P/%	R/%	mAP@0.50/%	Params/106	GFLOPs/G	FPS/(Frame·s⁻¹)
①	−	−	−	−	79.3	76.9	83.8	3.01	8.1	57.1
②	√	−	−	−	85.1	75.5	87.0	3.04	9.4	38.3
③	−	√		−	87.6	79.7	86.7	2.24	8.6	59.2
④	−	−	√	−	89.9	74.9	84.5	2.70	7.5	53.7
⑤	−	−	−	√	93.2	79.9	89.7	3.02	8.8	52.9
⑥	√	√	−	−	86.0	79.0	88.3	2.61	10.0	38.1
⑦	√	√	√	−	88.1	83.7	90.5	2.31	9.4	38.3
⑧	√	√	√	√	93.7	89.6	96.0	2.41	10.4	30.3

① indicates the original model, ②–⑧ indicates that the module in the original model has been modified, and multiple “√” s indicate multiple overlapping modifications to the module.

Table 4. Comparison experiment results.

Model	P/%	R/%	mAP@0.50/%	Params/106	GFLOPs/G	FPS/(Frame·s⁻¹)
Faster R-CNN	82.7	44.1	74.5	28.48	15,411.79	10.32
YOLOv7-tiny	82.9	55.2	70.4	10.80	8.2	56.9
DETR	80.5	70.9	78.7	19.89	57.0	12.6
YOLOv9	83.9	90.8	92.4	50.70	236.7	5.8
YOLOv11n	80.5	79.3	87.0	2.58	6.3	41.0
YOLO-Word	82.6	76.2	83.5	4.05	9.6	41.1
YOLOv10n	56.7	73.0	67.8	2.27	6.5	66.7
YOLOv8n	79.3	76.9	83.8	3.01	8.1	57.1
Ours	93.7	89.6	96.0	2.41	10.4	30.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, D.; Cai, W.; Shuai, H.; Wei, Z.; Chen, G. Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion. Remote Sens. 2026, 18, 1461. https://doi.org/10.3390/rs18101461

AMA Style

Guo D, Cai W, Shuai H, Wei Z, Chen G. Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion. Remote Sensing. 2026; 18(10):1461. https://doi.org/10.3390/rs18101461

Chicago/Turabian Style

Guo, Dudu, Wenxing Cai, Hongbo Shuai, Zhenxun Wei, and Guoliang Chen. 2026. "Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion" Remote Sensing 18, no. 10: 1461. https://doi.org/10.3390/rs18101461

APA Style

Guo, D., Cai, W., Shuai, H., Wei, Z., & Chen, G. (2026). Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion. Remote Sensing, 18(10), 1461. https://doi.org/10.3390/rs18101461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. The Oretical Overview of the YOLOv8 Algorithm

2.2. Improving the YOLOv8 Road Surface Defect Detection Algorithm

2.2.1. Design of Feature-Focusing Diffusion Pyramid Network

2.2.2. Design of Information Interaction Detection Head

2.2.3. Design of Edge Information Extraction Module

2.2.4. WaveletPool

3. Experiments and Analysis of Results

3.1. Experimental Setup and Evaluation Criteria

3.2. Ablation Experiment

3.3. Comparison Experiment

3.4. Detection Results

4. Discussion

4.1. Why the Proposed Modules Improve Detection

4.2. Comparison with State-of-the-Art Detectors

4.3. Distinguishing Distresses from Visually Similar Non-Distress Artifacts

4.4. Generalization, Failure Modes, and Limitations

4.5. On Novelty and the Engineering–Theory Trade-Off

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI