UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery

Cui, Bingyan; Liu, Zhen; Yang, Qifeng

doi:10.3390/drones9080533

Open AccessArticle

UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery

by

Bingyan Cui

¹

,

Zhen Liu

^2,*

and

Qifeng Yang

²

¹

Department of Civil and Environmental Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA

²

Department of Civil and Environmental Engineering, Penn State, University Park, State College, PA 16802, USA

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 533; https://doi.org/10.3390/drones9080533

Submission received: 27 June 2025 / Revised: 25 July 2025 / Accepted: 28 July 2025 / Published: 29 July 2025

(This article belongs to the Special Issue Advances in Civil Applications of Unmanned Aircraft Systems: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

A UAV-oriented segmentation network (UAV-YOLOv12) is proposed, integrating SKNet and PConv to enhance spatial adaptability and feature selectivity.
The model achieves high accuracy and generalization across four public aerial datasets, maintaining robust performance under occlusion and scale variation.

What is the implication of the main finding?

The proposed method provides a practical and efficient solution for UAV-based road monitoring in real-world environments.
It enables accurate extraction of small and occluded road structures, supporting real-time infrastructure assessment and smart city applications.

Abstract

Unmanned aerial vehicles (UAVs) are increasingly used for road infrastructure inspection and monitoring. However, challenges such as scale variation, complex background interference, and the scarcity of annotated UAV datasets limit the performance of traditional segmentation models. To address these challenges, this study proposes UAV-YOLOv12, a multi-scale segmentation model specifically designed for UAV-based road imagery analysis. The proposed model builds on the YOLOv12 architecture by adding two key modules. It uses a Selective Kernel Network (SKNet) to adjust receptive fields dynamically and a Partial Convolution (PConv) module to improve spatial focus and robustness in occluded regions. These enhancements help the model better detect small and irregular road features in complex aerial scenes. Experimental results on a custom UAV dataset collected from national highways in Wuxi, China, show that UAV-YOLOv12 achieves F1-scores of 0.902 for highways (road-H) and 0.825 for paths (road-P), outperforming the original YOLOv12 by 5% and 3.2%, respectively. Inference speed is maintained at 11.1 ms per image, supporting near real-time performance. Moreover, comparative evaluations with U-Net show that UAV-YOLOv12 improves by 7.1% and 9.5%. The model also exhibits strong generalization ability, achieving F1-scores above 0.87 on public datasets such as VHR-10 and the Drone Vehicle dataset. These results demonstrate that the proposed UAV-YOLOv12 can achieve high accuracy and robustness in diverse road environments and object scales.

Keywords:

UAV; road segmentation; YOLOv12; multi-scale features; feature enhancement

1. Introduction

As a critical component of modern urban systems, road infrastructure is increasingly important for maintaining traffic order, promoting economic development, and enhancing the operational efficiency [1]. The functionality of urban road networks impacts both travel safety and the circulation of social resources. Specifically, in the scope of intelligent transportation systems and smart city evolution, there is a growing need for automated and intelligent surveillance of infrastructure [2,3]. Traditional approaches to evaluating road conditions usually involve manual inspections or data collection through equipped vehicles [4]. These methods tend to be costly, inefficient, and severely restricted in their ability to cover large-scale areas within a limited timeframe [5].

The last few years have seen rapid advancements in unmanned aerial vehicles (UAVs), resulting in their increased use for monitoring roads in urban environments [6,7]. UAVs provide low cost and high mobility, along with wide-area coverage when compared to traditional platforms. By integrating high-resolution remote sensing systems, UAVs can streamline surveillance for roads [8]. Implemented with precision imaging systems, UAVs are able to gather intricate data about roads in urban complexes which aid in critical evaluations such as determining the state of the road, infrastructure needs performance assessment, maintenance scheduling, or even crafting responses for emergencies [9]. However, accurately and efficiently delineating road areas from countless UAV images remains a significant hurdle in both academic research and practical concerns [10].

Remote sensing images are often manually interpreted, but this procedure is bound by inefficiency, bias, low adaptability to changes over time highly influenced by subjective measures [11]. For example, region growing [12] and its improved algorithms [13] have been used to extract road networks, demonstrating the potential of geometric and texture constraints, yet their adaptability remained limited. These factors greatly enhance the complexity involved with influence of automating road extraction such as vanishing point perspective convergence causing recognizable simplifications of three dimensional objects onto a two-dimensional plane emerging through shadows casted by structures like trees and vehicles heavily affecting occlusion alongside peering intermittent visibility obstruction [14,15]. As a result, the use of artificial intelligence, particularly deep learning, to enhance the accuracy and efficiency of road detection in UAV imagery has become a key area of research [16,17].

Convolutional neural networks (CNNs) and object detection algorithms have achieved remarkable breakthroughs in computer vision and have shown promising applications in the automatic analysis of remote sensing imagery [18,19,20,21]. Recently, several studies have explored advanced deep learning approaches for road detection in UAV imagery, aiming to address the challenges of complex backgrounds, occlusions, and variable road appearances. For example, Zhao et al. proposed a context-aware segmentation network to automatically extract building and road information in urban functional areas from high-resolution remote sensing images, achieved accurate division of spatial units, and achieved a classification accuracy of 94% on the GaoFen-2 dataset [22]. Sun et al. proposed a remote sensing image building and road segmentation network. By integrating branches of different scales and the Transformer feature extraction mechanism, it effectively improved the global receptive field and spatial information retention capabilities, and outperformed traditional deep learning methods in multiple indicators [23].

Furthermore, many studies have adopted semantic segmentation methods to extract road outlines from UAV imagery [24,25]. Classical architectures such as U-Net [26], DeepLabV3 [27], and DeepLabV3+ [28] are widely used in remote sensing tasks and are particularly effective in pixel-level road segmentation. However, these models typically involve complex structures, large parameter sizes, and relatively slow inference speeds, making them less suitable for real-time applications on UAV platforms. In contrast, the YOLO series of segmentation algorithms, known for their end-to-end training capability, fast inference, and competitive accuracy, has increasingly been applied to road detection tasks in remote sensing images [29,30]. In Zhao et al. [31], the authors developed the real-time vehicle detection and road segmentation using aerial drone images with a new lightweight multi-task network YOLO-U enhanced by Ghost-Dilated convolution and Ghost-ASPP modules. This model noticeably enhances the accuracy of detecting narrow pathways as well as small objects in comparison to previous models.

Although these methods achieve some improvements, there are many unresolved issues dealing with road extraction from UAV imagery [32,33]. Most primary gaps come from lack of ability to extract small or multi-scaled secondary/non-standard roads. In addition to that, accuracy is not optimal because of background variables like shadows and occlusions, which often lead to blur during capture and increases false/fail detections due to noise. Third, many models suffer from weak generalization capabilities, limiting their performance across different geographic regions or environmental conditions. Finally, there remains a trade-off between model accuracy and computational complexity, which constrains deployment on UAV platforms with limited resources.

To address these challenges, this study proposes an improved YOLOv12-based segmentation, referred to as UAV-YOLOv12, specifically designed for automatic road boundary extraction from UAV-captured remote sensing imagery. The model incorporates several architectural enhancements based on the original YOLOv12 framework: (1) the integration of a Selective Kernel Network (SKNet) into the backbone to enable dynamic receptive field adaptation for better feature representation across varying scales; (2) the replacement of standard convolution with Partial Convolution (PConv) in key modules, focusing computation on informative regions and improving performance under occlusion or blur. The UAV-YOLOv12 model is designed to strike a balance between detection accuracy, computational efficiency, and generalization, offering a practical solution for UAV-based road monitoring, infrastructure assessment, and urban management tasks.

2. Methodology

Figure 1 illustrates the overall technical pipeline of the proposed UAV-YOLOv12 model, which is developed for automatic detection and segmentation of road targets from UAV-acquired imagery. UAV-YOLOv12 performs pixel-level semantic segmentation to generate class-specific masks, enabling accurate delineation of road areas beyond coarse bounding boxes. SKNet and PConv have been incorporated into the model to mitigate problems like differing scales of features, interference from complex backgrounds, and noise in aerial road scenes. The model undergoes a series of evaluations to check its detection accuracy and efficiency. Furthermore, these evaluations also test UAV-YOLOv12’s generalization performance on several open-source datasets, which not only confirms its robustness but also its transferability across different aerial environments.

2.1. Data Collection

In order to evaluate the practical effectiveness of the UAV-YOLO12 model, a comprehensive dataset was created specifically designed for road infrastructure using UAVs. This dataset included diverse categories of road environment and traffic objects with respect to resolution, angles of view, scale of the objects against the background complexity, and other factors that contribute to visual clutter. The data were all captured through UAVs and labeled manually by our researchers so as to ensure high quality standards without any integrations from open-source datasets.

2.1.1. UAV Detection

A DJI Mavic 2 Pro UAV equipped with a Hasselblad L1D-20c camera was used to capture images. This camera features a 1-inch CMOS sensor capable of capturing images at a resolution of 5472 × 3648 pixels [34]. The UAV was flown at altitudes ranging from 60 to 120 m, using DJI Ground Station Pro software 2.0 pre-programmed flight paths to enable fully automated data collection.

To ensure full scene coverage and geometric continuity, the front and side overlap rates were set at 80% and 70%, respectively. The data were collected over a national highway test section in Wuxi, Jiangsu Province, China. The flight areas include various road types such as urban arterial roads, expressways, rural roads, and elevated highways, representing a wide range of real-world road monitoring scenarios.

The ground sampling distance (GSD) of the collected images was approximately 0.1 m, and the average flight altitude of 100 m. The UAV maintained a flight speed of approximately 1.0 m/s, with a shooting interval of 346 s per image and a total of 9 stitched frames per survey line, ensuring comprehensive coverage and high-resolution visual data suitable for real-world road segmentation tasks.

2.1.2. Data Preprocessing

All raw images were first geometrically corrected and resampled to ensure consistency in spatial resolution and scale. Specifically, all images were rescaled to a uniform resolution of 0.1 m/pixel. Frames with severe shadowing, significant blur, or obvious redundancy were manually discarded to ensure sample integrity and annotation accuracy.

To improve the robustness and generalization of the model under varying conditions, we implemented a series of data augmentation strategies. These augmentations were conducted as a preprocessing step prior to network training, not as part of the architecture.

The specific augmentation pipeline includes the following transformations applied in random order:

(1): Random cropping: Cropping to 512 × 512 patches from the original image with at least 80% area overlap with labeled masks.
(2): Horizontal and vertical flipping: Each applied with a 50% probability.
(3): Histogram equalization: Used to enhance contrast under low-light conditions.
(4): Color jittering: Brightness, contrast, saturation, and hue were randomly adjusted within the range of ±15%.
(5): Random rotation: Applied within ±10° range to simulate UAV yaw variability.
(6): Gaussian blur: σ randomly selected between 0.1 and 1.0, applied with 30% probability to simulate motion blur.

2.1.3. Dataset Information

All images were manually labeled with the assistance of semi-automated tools by experienced annotators with backgrounds in remote sensing and transportation engineering. Key annotated categories include road boundaries, lane markings, manholes, cracks, vehicles, and traffic signs.

The final dataset consists of 2000 images, of which 1730 images are annotated, and 270 are unlabeled, with a total of more than 3300 labels. To ensure diversity in object scales, approximately 35% of the annotated targets are small (area < 32 × 32 pixels), 48% are medium-sized, and 17% are large objects.

For the training process, the dataset was split into three subsets, as listed in Table 1. In the table, road-H refers to labeled objects captured from highways and elevated roads, while road-P represents labels from paths or secondary roads, including urban streets, rural roads, and branch lanes. This distinction enables detailed analysis across different types of road infrastructure and helps evaluate the model’s adaptability to varied road environments.

All data in this study were collected and constructed independently using UAV systems deployed in the Wuxi national highway monitoring area. The dataset is original, comprehensive, and fully representative of UAV-based road scenarios. It provides a solid foundation for multi-scale object detection experiments and serves as a reliable benchmark for autonomous road feature recognition in UAV remote sensing applications.

2.2. The Original YOLOv12

YOLOv12 represents the twelfth generation in the YOLO (You Only Look Once) object detection family [35]. It is the first version to integrate attention mechanisms as a core architectural element, aiming to enhance the detection of small and occluded objects in complex environments while maintaining real-time inference speed. To ensure real-time performance and low computational overhead for UAV deployment scenarios, we adopt the YOLOv12-N model as the base model for our implementation. The overall architecture of YOLOv12-N includes three main components: the Backbone, Neck, and Segmentation Head, as presented in Figure 2.

Backbone: The backbone adopts an improved version of the Efficient Layer Aggregation Network (ELAN) architecture, known as R-ELAN. This design introduces block-level residual connections and feature scaling to address issues like gradient vanishing and training instability in deep networks. By improving feature reuse and aggregation across layers, R-ELAN can significantly strengthen the ability to represent multi-scale objects.
Neck: The neck incorporates an Area Attention mechanism, which divides feature maps into horizontal and vertical regions to reduce the computational complexity of attention while preserving a large receptive field. In addition, YOLOv12 integrates FlashAttention, a high-efficiency memory access optimization for attention operations that accelerates inference speed. For multi-scale feature aggregation, the neck employs the A2C2f module, which facilitates effective fusion of semantic features at different scales, thereby improving detection performance for targets of varying sizes.
Head: The head includes multi-scale detection and segmentation. The YOLOv12 head comprises three parallel branches operating on feature maps P₃, P₄ and P₅; each branch uses cascaded 1 × 1 and 3 × 3 convolutions (C3k2/C3k modules) and prediction layers to localize and classify small, medium, and large objects, respectively. Immediately following each detection branch, a lightweight segmentation subhead fuses multi-scale features via A2C2f blocks, upsampling and concatenations to produce a per-pixel mask at the corresponding resolution, sharpening object boundaries. By jointly learning detection and segmentation in an end-to-end architecture, this dual-task head delivers high precision and robust boundary accuracy across diverse object scales without compromising real-time inference speed.

2.3. UAV-YOLOv12

To improve the accuracy and robustness of YOLOv12 in detecting multi-scale road infrastructure targets such as lane markings, manholes, and surface cracks in UAV imagery, two critical architectural modifications were introduced. The structure of the improved model is illustrated in Figure 3.

2.3.1. PConv

In the original YOLOv12 architecture, standard convolution (Conv) operations are extensively used in modules such as feature extraction, downsampling, and multi-scale fusion. Although conventional convolution is effective in encoding visual features, its globally uniform processing approach tends to produce redundant responses in irrelevant regions. This limitation becomes especially pronounced in UAV imagery, where occlusions, shadows, and motion blur are frequent. In such cases, conventional convolution struggles to selectively emphasize meaningful regions. This is particularly problematic in UAV road scenes where occlusions and image degradation are common, causing standard convolutional filters to encode ambiguous or noisy features without discrimination.

Originally proposed for image inpainting tasks, PConv introduces a mask-guided mechanism that selectively activates convolution operations only over valid pixels in the input feature map. To address this issue, this study incorporates PConv in place of standard convolution. The structure of the PConv module is shown in Figure 4. The core idea of PConv is to apply convolution operations only to valid regions of the input, while maintaining an identity mapping over invalid or irrelevant areas, such as occluded zones, background clutter, or missing pixels. This design improves both the specificity and sparsity of feature representations.

The mechanism of PConv enables the model to distinguish between informative and irrelevant spatial regions at the convolution level, improving robustness and feature compactness. In practice, PConv introduces a dynamic masking mechanism that identifies valid input regions within each convolution window. Based on this binary mask, the input features and convolution kernels are selectively activated and normalized to focus computation only on informative regions. Compared to traditional convolution, PConv provides several advantages in sparse feature modeling:

It focuses only on regions that contain significant semantic content and automatically ignores low-information areas such as repetitive textures, vegetation, and lighting artifacts commonly found in UAV road scenes.
Since PConv operates only on activated subregions, it eliminates unnecessary computations, leading to lower memory usage and fewer parameters without sacrificing performance.
The method is particularly effective in detecting object boundaries and structural defects in road targets, making it suitable for fine-grained tasks such as crack localization and edge delineation.
PConv demonstrates more stable performance under challenging conditions, including occlusion, missing data, and heavy background noise, which are prevalent in UAV-based remote sensing imagery.
By explicitly modeling valid spatial support, PConv enhances the interpretability of feature maps, enabling better visual analysis of the model’s attention regions during UAV-based road detection.

In this work, PConv is integrated into both the Backbone and Neck modules of the YOLOv12 architecture. This integration significantly reduces the overall computational burden and model complexity, while preserving detection accuracy. Experimental results show that the modified model with PConv achieves better target localization accuracy and faster inference speed, particularly in detecting small-scale objects and targets in low-contrast regions. It also offers enhanced resilience in ambiguous boundary regions where traditional convolution tends to produce blurred or diluted responses.

2.3.2. SKNet

The SKNet module is originally introduced as a dynamic attention mechanism to address the inherent limitation of fixed convolutional kernels in adapting to scale-variant and spatially inconsistent road features commonly observed in UAV imagery. In order to improve the capability of YOLOv12 in detecting objects of varying scales, particularly small and fine-grained targets in UAV remote sensing images, a SKNet module was introduced at the end of the backbone. This module dynamically adjusts the receptive field of convolutional neurons, enabling the network to adaptively select suitable kernel sizes according to the scale and position of the objects in the image. As a result, the network becomes more responsive to key regions and captures richer spatial context. This is particularly beneficial for UAV imagery, where objects often exhibit significant scale variations and are surrounded by complex backgrounds.

The core idea of SKNet is to simulate scale-aware feature selection by allowing the model to choose between convolutional responses obtained from different kernel sizes. As illustrated in Figure 5, the SKNet module operates through three main stages:

Split: The input feature map Z is simultaneously processed by two convolutional branches with different kernel sizes. One branch uses a standard 3 × 3 convolution, and the other applies a dilated 5 × 5 convolution with a dilation rate of 2. These two branches are designed to extract both local and broader contextual features, capturing multi-scale semantic information.
Fuse: The outputs from the two branches are added elementwise to generate an intermediate feature map U, which is then globally average pooled to generate a channel descriptor S. This descriptor passes through two fully connected layers followed by a SoftMax activation to produce two channel-wise attention weights a and b, which represent the relative significance of each branch. The attention weights a and b enable the model to dynamically prioritize convolution outputs that best match the local structural context. This adaptive mechanism allows SKNet to enhance features relevant to road boundaries, particularly in cases with weak or missing edge cues.
Select: The final output V is obtained by applying the weight a and b to the corresponding branch outputs and summing the weighted features. This process allows the network to emphasize the most informative receptive field for each spatial location based on the content of the input image.

With SKNet, the model can adjust its perception dynamically according to the complexity of the scene and the size of the target. This is especially beneficial for UAV images where small and narrow structures such as cracks, edge lines, and manhole covers are common. The use of SKNet effectively compensates for the limitations of fixed receptive fields and enhances the model’s responsiveness to small-scale objects.

Additionally, the SKNet module is integrated with the A2C2f-based feature fusion path to construct a multi-scale feature stream with selective attention. This integration improves the continuity of spatial details during upsampling and downsampling and further strengthens the detection performance of small and medium-sized objects, while maintaining the model’s efficiency.

These capabilities make SKNet not just beneficial but essential for high-fidelity segmentation in UAV road imagery, where road structures exhibit diverse widths, curvatures, and edge clarity. Without such adaptive modeling, models relying solely on static kernels often suffer from performance degradation in underrepresented or occluded road areas.

2.3.3. Loss Function

YOLOv12 adopts a multi-task loss function to jointly optimize object localization, classification, and confidence prediction [36]. Based on this framework, the following improvements are introduced in this study.

(1): The original YOLOv12 uses CIoU or DIoU loss to measure the discrepancy between predicted and ground-truth bounding boxes. In this work, we incorporate the more advanced SIoU (SCYLLA-IoU) loss. This method not only considers center distance, overlap area, and aspect ratio but also integrates angle and boundary direction information. This is particularly beneficial for scenarios involving slanted or curved structures, such as bent roads and oblique lane markings, significantly enhancing localization accuracy.
(2): To address the class imbalance problem commonly found in UAV images, where small targets are underrepresented and background classes dominate, we adopt the Focal Loss mechanism. This adjusts the loss contribution of well-classified examples, reducing the impact of easy negatives and helping the model focus more on difficult and underrepresented classes.
(3): To mitigate the imbalance between positive and negative samples during training, we introduce label smoothing to soften hard classification boundaries.

The total loss function is defined as:

L_{t o t a l} = λ_{c l s} g L_{c l s} + λ_{o b j} g L_{o b j} + λ_{b o x} g L_{S I o U}

(1)

where λ_cls, λ_obj, and λ_box are the balancing weights for classification loss, objectness confidence loss, and bounding box regression loss, respectively. In this study, they are empirically set to a ratio of 1:5:2.

2.4. Evaluation Metrics

To comprehensively assess the performance of the proposed UAV-YOLO12 model in UAV-based object detection tasks, this study adopts several commonly used evaluation metrics for image segmentation and object detection. These include Precision, Recall, F1-Score, and Intersection over Union (IoU).

(1) Precision

Precision (P) represents the ratio of true positive results to the total number of samples classified as positive. It indicates how accurate the positive predictions are and is expressed by Equation (2) [37]:

Precision = \frac{T P}{T P + F P}

(2)

where TP refers to the number of true positives, and FP indicates the number of false positives.

(2) Recall Rate

Recall rate (R) evaluates the ability of model to detect all actual positive cases. It is the ratio of true positives to the total number of actual positive instances, and is defined as Equation (3):

Recall rate = \frac{T P}{T P + F N}

(3)

where FN represents the number of false negatives.

(3) F₁ score

F₁ score is the harmonic mean of Precision and Recall, and it provides a single metric that balances both correctness and completeness. It is defined as Equation (4) [38]:

F_{1} score = 2 \frac{P \times R}{P + R}

(4)

A higher F1-score reflects improved model performance by balancing both prediction accuracy and the ability to detect relevant targets.

(4) IoU

IoU is a fundamental metric in image segmentation and object detection. It measures the degree of overlap between the predicted region and the ground truth, defined as Equation (5) [34]:

IoU = \frac{Area of Overlap}{Area of Union} = \frac{T P}{T P + F P + F N}

(5)

Higher IoU values indicate that the predicted bounding boxes are closer to the ground-truth annotations, suggesting better localization performance. In our experiments, we report the average IoU (Mean IoU, mIoU) as an overall performance indicator.

Note that in this study, the IoU is computed at the pixel level, consistent with the pixel-wise semantic segmentation setting. Here, TP, FP, and FN, respectively, represent the number of pixels correctly predicted as road (true positive), non-road pixels incorrectly classified as road (false positive), and road pixels not detected by the model (false negative). This ensures that the overlap measurement reflects segmentation performance rather than object-level bounding box matching. In our implementation, we utilize the segmentation version of YOLOv12, where each pixel is classified into a semantic category with an associated confidence score. A fixed threshold of 0.5 is applied to the confidence map to generate the final binary segmentation mask for evaluation, following standard practices.

3. Results and Discussions

All experiments were run on a workstation with Intel i9-13900K CPU (Intel Corporation, Santa Clara, CA, USA), NVIDIA RTX 3090 (24 GB) (NVIDIA Corporation, Santa Clara, CA, USA), 128 GB RAM (Kingston Technology, Fountain Valley, CA, USA), using Ubuntu 20.04, CUDA 11.8, and PyTorch 2.0. The training and evaluation processes were implemented using a customized version of the YOLOv12 framework.

All YOLOv12 variants were trained with an initial learning rate of 0.001, applying cosine decay. The SGD optimizer was used with 0.937 momentum and 0.0005 weight decay. Training ran for 300 epochs with a batch size of 32. Anchor-free detection heads were employed with multi-scale training enabled. All models were trained from scratch using only the proposed UAV dataset, without any pre-trained weights.

3.1. Abation Experiment Results

To evaluate the effectiveness of the proposed structural improvements in UAV-YOLO12, we conducted ablation experiments by incrementally incorporating two key components into the baseline YOLOv12 model: the SKNet attention module and the PConv module. The experimental results are listed in Table 2.

The original YOLOv12 model achieves a precision of 0.831 and 0.758 on highway (road-H) and path (road-P) categories, respectively. However, its recall rates are relatively lower (0.723 and 0.723), resulting in moderate F1-scores of 0.831 (road-H) and 0.740 (road-P), and IoU scores of 0.833 and 0.647.

After introducing the SKNet module only (YOLOv12 ¹), detection performance shows a clear improvement. For road-P, precision increases from 0.758 to 0.795 (+4.9%), and recall from 0.723 to 0.784 (+8.4%), leading to a 6.6% increase in F1-score and a 9.0% improvement in IoU (from 0.647 to 0.705). These results demonstrate that SKNet effectively enhances the model’s sensitivity to spatially variable and small-scale structures, particularly in less-structured environments like paths.

When incorporating PConv only (YOLOv12 ²), the model shows further improvement in both categories. Compared to the baseline, the precision on road-H increases from 0.831 to 0.878 (+5.6%), and recall improves from 0.723 to 0.779 (+7.7%). For road-P, the precision rises to 0.812 (+7.1%), and the F1-score reaches 0.795 (+7.4%), indicating PConv’s strength in suppressing background noise and enhancing response in complex textures.

The UAV-YOLO12 model, which integrates both SKNet and PConv modules, achieves the highest overall performance. The precision reaches 0.913 and 0.845 for road-H and road-P, respectively, representing improvements of 9.9% and 11.5% over the baseline. The recall scores are also the best among all configurations (0.891 for road-H and 0.805 for road-P), leading to F1-scores of 0.902 and 0.825. The IoU scores reach 0.891 and 0.795, improving by 5.8% and 14.8%, respectively, compared to the original model.

These results confirm that both SKNet and PConv bring complementary benefits: SKNet improves spatial attention and feature selection, while PConv enhances robustness to occlusions and reduces redundant activations. The effectiveness of SKNet stems from its ability to dynamically adjust receptive fields by selecting between convolution kernels of different sizes. This mechanism allows the network to adaptively focus on the most relevant scale for a given spatial location, making it particularly effective in capturing both fine-grained road markings and broader structural features. This dynamic attention improves localization accuracy in scenes where road features vary significantly in scale or are partially occluded. On the other hand, PConv contributes by enforcing sparsity in feature activation. By masking out irrelevant or noisy regions (e.g., shadows, vegetation, or occluded areas), PConv enables the model to concentrate computation on informative content, which leads to more stable predictions under complex backgrounds. This targeted filtering also reduces overfitting to background clutter and improves generalization to diverse road types.

In combination, SKNet’s adaptive spatial awareness and PConv’s selective feature activation create a complementary synergy: SKNet enhances semantic richness across scales, while PConv ensures robustness and noise suppression at the spatial level. This dual enhancement contributes to the model’s superior overall detection performance, especially in UAV imagery where targets are often small, fragmented, and subject to environmental variability.

In addition to accuracy metrics, we evaluated the inference speed of each model on the same hardware environment. As observed in Table 2, the baseline YOLOv12 model achieves the fastest inference speed, averaging 8.3 milliseconds per image. When incorporating the SKNet attention module (YOLOv12 ¹), the testing time slightly increases to 9.5 ms, due to the added dynamic kernel selection operations.

Similarly, replacing standard convolution with PConv (YOLOv12 ²) results in a moderate increase in testing time to 9.8 ms, attributed to the additional mask computation and conditional convolution operations. The fully enhanced UAV-YOLOv12 model, which combines both SKNet and PConv, requires 11.1 ms per image on average. Although there is a 33.7% increase in testing time compared to the baseline, the trade-off is justified by the notable gains in detection accuracy and robustness, especially in small and complex road objects. These results indicate that the proposed improvements moderately affect inference speed but significantly enhance overall performance, making UAV-YOLOv12 a practical and effective solution for UAV-based road monitoring tasks.

3.2. Experimental Results of Contrast Tests

3.2.1. Comparison of Detection Results with State-of-the-Art Models

To further validate the effectiveness of the proposed UAV-YOLOv12 model, we compared it with two widely used semantic segmentation models, U-Net and DeepLabV3+, as well as with the baseline YOLOv12. The evaluation results are summarized in Table 3, covering four key metrics (Precision, Recall, F1-score, IoU) across two road categories (road-H and road-P), along with testing time.

The traditional segmentation model U-Net achieved relatively stable performance on highway targets, with an F1-score of 0.831 and IoU of 0.837. However, its recall and IoU dropped significantly for path targets (0.715 and 0.714), reflecting limitations in handling small, irregular structures. Its average testing time per image was 14.2 ms, which is higher than detection-based models, limiting its application in real-time UAV scenarios.

DeepLabV3+ slightly outperformed U-Net in both precision and F1-score across categories, benefiting from its Atrous Spatial Pyramid Pooling. However, it still suffered from lower recall on road-H (0.827) and a relatively weak IoU on road-P (0.695). Moreover, its computational overhead was the highest among all models, with an average inference time of 21.8 ms.

In addition, two Transformer-based models were evaluated to further validate the proposed method against recent advances. Swin-UperNet, which integrates hierarchical Swin Transformer backbones with a UPerNet decoding head, achieved an F₁-score of 0.846 on road-H and 0.776 on road-P, with corresponding IoU values of 0.845 and 0.701. CMTFNet, a hybrid model that combines CNN and multi-scale Transformer fusion, showed slightly better accuracy with F₁-scores of 0.853 and 0.793, and IoUs of 0.852 and 0.719. However, both models suffered from significantly higher inference times, with Swin-UperNet and CMTFNet averaging 29.7 ms and 23.2 ms per image, respectively, which limits their applicability in real-time UAV deployment scenarios.

Compared with these segmentation-based methods, the baseline YOLOv12 showed noticeable advantages in both detection accuracy and speed. For road-H and road-P, YOLOv12 achieved F1-scores of 0.852 and 0.793, with IoU scores of 0.851 and 0.723, respectively. Importantly, the model maintained a much faster inference time of 9.8 ms, demonstrating better suitability for onboard UAV deployment.

The proposed UAV-YOLOv12 model significantly outperformed all baselines in all metrics. It achieved the highest F1-scores of 0.902 (road-H) and 0.825 (road-P), as well as the best IoU scores of 0.891 and 0.795. Compared with YOLOv12, it has a relative improvement of 5% and 3.2% in F1-score. This gain comes with only a modest increase in inference time to 11.1 ms, which still supports near real-time performance. Compared with U-Net, a higher improvement of 7.1% and 9.5% can be observed.

To further illustrate the comparative performance of different models, several representative detection results are visualized in Figure 6. In each image, the blue area represents the predicted road region, the label and confidence score indicate the classification and certainty of prediction, while the red bounding boxes highlight missed detections, and the yellow boxes mark false positives.

As shown in the figure, UAV-YOLOv12 consistently produces the most accurate and complete segmentation, especially for complex scenes involving curved highways and narrow paths. Only a single minor missed region is observed in UAV-YOLOv12 outputs, with most road structures accurately delineated and confidently classified.

In contrast, the other models exhibit noticeable limitations. Both U-Net and DeepLabV3+ show substantial missed detections, particularly on small or fragmented path targets (road-P). Notably, DeepLabV3+ also produces a large false positive region (highlighted in yellow), mistakenly identifying background areas as road segments. This suggests reduced discriminative capability in low-contrast and texture-complex UAV imagery.

These visual results provide strong qualitative evidence that UAV-YOLOv12 not only improves quantitative accuracy but also achieves significantly better visual consistency, especially in recognizing narrow, occluded, or small-scale road elements in UAV-based remote sensing scenarios.

3.2.2. Comparison of Detection Results of Different Labels

To further evaluate the model’s ability to handle varying object scales and characteristics, we compared detection results for two distinct road categories: road-H (highways) and road-P (paths), as illustrated in Table 4.

The road-H category typically covers larger pixel areas with clear contours and well-defined lane markings, such as white dashed or solid lines. These structural cues provide strong visual priors for detection models, leading to highly accurate predictions. UAV-YOLOv12 achieves a detection precision of 91.3% and recall of 89.1% on road-H, with minimal missed regions, as seen in the left portion of Figure 5.

In contrast, the road-P category represents smaller, less regular targets such as rural paths or narrow walking trails. These targets exhibit weaker contrast with the background and are often surrounded by vegetation, farmland, or unstructured terrain. As a result, road-P detection is inherently more challenging. Despite this, UAV-YOLOv12 still achieves a precision of 84.5% and recall of 80.5%, demonstrating robust performance even under adverse conditions.

These results highlight the model’s strong generalization capability. While road-H benefits from more distinguishable visual features, the model’s ability to maintain high accuracy on road-P confirms the effectiveness of the proposed enhancements, particularly the SKNet and PConv modules, in capturing fine-grained spatial structures and suppressing background interference.

3.3. Generalization Performance Evaluation

To assess the generalization capability of UAV-YOLOv12 across diverse environments, we conducted inference tests on four publicly available datasets: Massachusetts Roads [39], UC Merced Land-Use [40], NWPU VHR-10 [41], and Drone Vehicle [42]. These datasets cover a wide variety of aerial scenes, including urban expressways, rural roads, mixed vegetation, parking lots, and complex backgrounds. The quantitative results are reported in Table 5.

On the Massachusetts Roads dataset, which focuses on road segmentation from satellite images, UAV-YOLOv12 achieves F1-scores of 0.848 (road-H) and 0.806 (road-P), with IoU values reaching 0.845 and 0.812, respectively. Despite domain shifts in image style and resolution, the model retains strong performance, confirming the robustness of its multi-scale feature design.

In the UC Merced Land-Use Dataset, where no explicit road-P category is labeled, UAV-YOLOv12 still performs well on road-H targets, obtaining an F1-score of 0.840 and IoU of 0.842. Similarly, high scores were observed on the NWPU dataset, with an overall F1-score of 0.871 and IoU of 0.873, indicating the model’s adaptability to aerial images with various target densities and object scales.

The Drone Vehicle Dataset, which features low-altitude UAV imagery with traffic scenes and occlusions, further demonstrates the model’s capability. UAV-YOLOv12 achieves an F1-score of 0.879 and IoU of 0.881, outperforming many baseline detectors reported in previous studies.

Visual examples in Figure 7 reinforce these findings. The predicted road masks (blue regions) align well with actual road boundaries, even under challenging lighting, occlusion, and layout variations. High-confidence predictions (e.g., >0.87) are observed across diverse environments, such as expressway interchanges, rural bends, and dense traffic scenes. Notably, even on unseen datasets, the model maintains structural consistency and semantic precision. These results validate the strong generalization performance of UAV-YOLOv12. The integration of SKNet and PConv modules, combined with multi-scale fusion, contributes to enhanced transferability and makes the model suitable for real-world UAV applications beyond the original training domain.

Although the proposed model demonstrates strong generalization performance across various datasets, we also observed occasional missed detections. For instance, in the UC Merced Land-Use Dataset, some road boundaries, particularly those beneath overpasses or in occluded areas, were not detected. This could be attributed to limited visibility, weak edge features, or annotation ambiguity. In future work, we plan to incorporate occlusion-aware learning strategies and enhanced contextual modules to mitigate such issues and further improve segmentation accuracy.

4. Conclusions

In this study, we proposed UAV-YOLOv12, a multi-scale segmentation model tailored for UAV-based road infrastructure monitoring. The model integrates SKNet and PConv to enhance detection accuracy, particularly for small and irregular road targets in complex aerial scenes. Through extensive experiments, we draw the following conclusions:

(1): The proposed UAV-YOLOv12 outperforms both traditional segmentation models (U-Net, DeepLabV3+) and the baseline YOLOv12 in all key metrics. It achieves F1-scores of 0.902 (road-H) and 0.825 (road-P), with IoUs up to 0.891 and 0.795, respectively.
(2): The UAV-YOLOv12 still maintains high accuracy on multiple public datasets, including Massachusetts Roads and Drone Vehicle datasets, demonstrating strong generalization ability.
(3): In addition to architectural enhancements, UAV-YOLOv12 also maintains a real-time inference speed of 11.1 ms per image, supporting deployment on onboard UAV platforms.

However, the current model still faces some challenges in extremely cluttered scenes and under poor weather conditions such as fog, rain, or low illumination. In future work, we plan to integrate temporal consistency from UAV video streams and explore lightweight backbones to further enhance performance while minimizing computational costs. Additionally, expanding annotated UAV datasets will be essential for improving recognition in diverse road scenarios.

Author Contributions

Conceptualization, Z.L. and B.C.; methodology, Z.L. and B.C.; software, B.C. and Z.L.; validation, Z.L. and B.C.; formal analysis, B.C. and Q.Y.; investigation, Z.L. and B.C.; resources, Z.L.; data curation, Z.L., B.C. and Q.Y.; writing—original draft preparation, B.C. and Z.L.; writing—review and editing, Z.L.; visualization, Z.L. and B.C.; supervision, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are not publicly available due to manufacturer restrictions. Sample images can be requested from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, J.; Yoon, Y. Indicators development to support intelligent road infrastructure in urban cities. Transp. Policy 2021, 114, 252–265. [Google Scholar] [CrossRef]
Ye, Z.; Wei, Y.; Yang, S.; Li, P.; Yang, F.; Yang, B.; Wang, L. IoT-enhanced smart road infrastructure systems for comprehensive real-time monitoring. Internet Things Cyber Phys. Syst. 2024, 4, 235–249. [Google Scholar] [CrossRef]
Liu, Z.; Wang, S.; Gu, X.; Dong, Q. Non-destructive testing and intelligent evaluation of road structural conditions using GPR and FWD. J. Traffic Transp. Eng. 2025, 12, 462–476. [Google Scholar] [CrossRef]
Barbieri, D.M.; Lou, B. Instrumentation and testing for road condition monitoring—A state-of-the-art review. NDT E Int. 2024, 146, 103161. [Google Scholar] [CrossRef]
Zhang, A.A.; Shang, J.; Li, B.; Hui, B.; Gong, H.; Li, L.; Zhan, Y.; Ai, C.; Niu, H.; Chu, X.; et al. Intelligent pavement condition survey: Overview of current researches and practices. J. Road Eng. 2024, 4, 257–281. [Google Scholar] [CrossRef]
Xu, H.; Wang, L.; Han, W.; Yang, Y.; Li, J.; Lu, Y.; Li, J. A Survey on UAV Applications in Smart City Management: Challenges, Advances, and Opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8982–9010. [Google Scholar] [CrossRef]
Tilon, S.; Nex, F.; Vosselman, G.; Sevilla de la Llave, I.; Kerle, N. Towards Improved Unmanned Aerial Vehicle Edge Intelligence: A Road Infrastructure Monitoring Case Study. Remote. Sens. 2022, 14, 4008. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the Unmanned Aerial Vehicles (UAVs): A Comprehensive Review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Kujawski, A.; Dudek, T. Analysis and visualization of data obtained from camera mounted on unmanned aerial vehicle used in areas of urban transport. Sustain. Cities Soc. 2021, 72, 103004. [Google Scholar] [CrossRef]
Biçici, S.; Zeybek, M. An approach for the automated extraction of road surface distress from a UAV-derived point cloud. Autom. Constr. 2021, 122, 103475. [Google Scholar] [CrossRef]
Wang, M.; Wen, Y.; Wei, S.; Hu, J.; Yi, W.; Shi, J. SA-ISAR Imaging via Detail Enhancement Operator and Adaptive Threshold Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5205914. [Google Scholar] [CrossRef]
Xiao, F.; Tong, L.; Luo, S. A Method for Road Network Extraction from High-Resolution SAR Imagery Using Direction Grouping and Curve Fitting. Remote Sens. 2019, 11, 2733. [Google Scholar] [CrossRef]
Zeng, T.; Gao, Q.; Ding, Z.; Chen, J.; Li, G. Road Network Extraction from Low-Contrast SAR Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 907–911. [Google Scholar] [CrossRef]
Lin, H.; Hong, D.; Ge, S.; Luo, C.; Jiang, K.; Jin, H.; Wen, C. RS-MoE: A Vision–Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5614918. [Google Scholar] [CrossRef]
Tuia, D.; Schindler, K.; Demir, B.; Zhu, X.X.; Kochupillai, M.; Džeroski, S.; van Rijn, J.N.; Hoos, H.H.; Frate, F.D.; Datcu, M.; et al. Artificial Intelligence to Advance Earth Observation: A review of models, recent trends, and pathways forward. IEEE Geosci. Remote Sens. Mag. 2024, 2–25. [Google Scholar] [CrossRef]
Silva, L.A.; Leithardt, V.R.Q.; Batista, V.F.L.; González, G.V.; Santana, J.F.D.P. Automated Road Damage Detection Using UAV Images and Deep Learning Techniques. IEEE Access 2023, 11, 62918–62931. [Google Scholar] [CrossRef]
Datta, S.; Durairaj, S. Review of Deep Learning Algorithms for Urban Remote Sensing Using Unmanned Aerial Vehicles (UAVs). Recent Adv. Comput. Sci. Commun. 2024, 17, 66–77. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Maxwell, A.E.; Warner, T.A.; Guillén, L.A. Accuracy Assessment in Convolutional Neural Network-Based Deep Learning Remote Sensing Studies—Part 1: Literature Review. Remote Sens. 2021, 13, 2450. [Google Scholar] [CrossRef]
Singh, P.P.; Garg, R.D. A two-stage framework for road extraction from high-resolution satellite images by using prominent features of impervious surfaces. Int. J. Remote Sens. 2014, 35, 8074–8107. [Google Scholar] [CrossRef]
Khan, M.J.; Singh, P.P.; Pradhan, B.; Alamri, A.; Lee, C.W. Extraction of Roads Using the Archimedes Tuning Process with the Quantum Dilated Convolutional Neural Network. Sensors 2023, 23, 8783. [Google Scholar] [CrossRef]
Zhao, W.; Li, M.; Wu, C.; Zhou, W.; Chu, G. Identifying Urban Functional Regions from High-Resolution Satellite Images Using a Context-Aware Segmentation Network. Remote Sens. 2022, 14, 3996. [Google Scholar] [CrossRef]
Sun, Z.; Zhou, W.; Ding, C.; Xia, M. Multi-Resolution Transformer Network for Building and Road Segmentation of Remote Sensing Image. ISPRS Int. J. Geo-Inf. 2022, 11, 165. [Google Scholar] [CrossRef]
Xiao, R.; Wang, Y.; Tao, C. Fine-Grained Road Scene Understanding from Aerial Images Based on Semisupervised Semantic Segmentation Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3001705. [Google Scholar] [CrossRef]
Behera, T.K.; Bakshi, S.; Sa, P.K.; Nappi, M.; Castiglione, A.; Vijayakumar, P.; Gupta, B.B. The NITRDrone Dataset to Address the Challenges for Road Extraction from Aerial Images. J. Signal Process. Syst. 2023, 95, 197–209. [Google Scholar] [CrossRef]
Sultonov, F.; Park, J.H.; Yun, S.; Lim, D.W.; Kang, J.M. Mixer U-Net: An Improved Automatic Road Extraction from UAV Imagery. Appl. Sci. 2022, 12, 1953. [Google Scholar] [CrossRef]
Mahmud, M.N.; Osman, M.K.; Ismail, A.P.; Ahmad, F.; Ahmad, K.A.; Ibrahim, A. Road Image Segmentation using Unmanned Aerial Vehicle Images and DeepLab V3+ Semantic Segmentation Model. In Proceedings of the 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 27–28 August 2021; pp. 176–181. [Google Scholar]
Mahara, A.; Khan, M.R.K.; Deng, L.; Rishe, N.; Wang, W.; Sadjadi, S.M. Automated Road Extraction from Satellite Imagery Integrating Dense Depthwise Dilated Separable Spatial Pyramid Pooling with DeepLabV3+. Appl. Sci. 2025, 15, 1027. [Google Scholar] [CrossRef]
Wei, X.; Li, Z.; Wang, Y. SED-YOLO based multi-scale attention for small object detection in remote sensing. Sci. Rep. 2025, 15, 3125. [Google Scholar] [CrossRef]
Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [Google Scholar] [CrossRef]
Zhao, Z.; He, P. YOLO-U: Multi-task model for vehicle detection and road segmentation in UAV aerial imagery. Earth Sci. Inform. 2024, 17, 3253–3269. [Google Scholar] [CrossRef]
Walambe, R.; Marathe, A.; Kotecha, K. Multiscale Object Detection from Drone Imagery Using Ensemble Transfer Learning. Drones 2021, 5, 66. [Google Scholar] [CrossRef]
Wang, W.; Yang, N.; Zhang, Y.; Wang, F.; Cao, T.; Eklund, P. A review of road extraction from remote sensing images. J. Traffic Transp. Eng. 2016, 3, 271–282. [Google Scholar] [CrossRef]
Liu, Z.; Wu, W.; Wang, D.; Cui, B.; Gu, X. Automatic extraction and 3D modeling of real road scenes using UAV imagery and deep learning semantic segmentation. Int. J. Digit. Earth 2024, 17, 2365970. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Ge, T.; Ning, B.; Xie, Y. YOLO-AFR: An Improved YOLOv12-Based Model for Accurate and Real-Time Dangerous Driving Behavior Detection. Appl. Sci. 2025, 15, 6090. [Google Scholar] [CrossRef]
Liu, Z.; Wang, S.; Gu, X.; Wang, D.; Dong, Q.; Cui, B. Intelligent Assessment of Pavement Structural Conditions: A Novel FeMViT Classification Network for GPR Images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13511–13523. [Google Scholar] [CrossRef]
Liu, Q.; Cui, B.; Liu, Z. Air Quality Class Prediction Using Machine Learning Methods Based on Monitoring Data and Secondary Modeling. Atmosphere 2024, 15, 553. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto (Canada): Toronto, ON, Canada, 2013. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-quality instance segmentation for remote sensing imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]

Figure 1. The framework of UAV-YOLOv12 for automated road segmentation.

Figure 2. The network of original YOLOv12-N model for road segmentation. (Note: The bottleneck module refers to a compact convolutional block that reduces the number of parameters and computational cost by first compressing and then expanding the feature dimensions, commonly used to improve efficiency without significantly sacrificing performance).

Figure 3. The network of UAV-YOLOv12 model for road segmentation.

Figure 4. The network of PConv module for road extraction. The input is first processed with a dynamic binary mask to identify valid regions. Convolution filters are applied only to valid areas (as shown by the * symbol indicating element-wise convolution). The “PConv” arrow represents the transformation process where partial convolution is applied to extract features while skipping masked regions.

Figure 5. The network of SKNet module for road extraction.

Figure 6. Visual comparison of detection results from different models on UAV road imagery.

Figure 7. Detection results of UAV-YOLOv12 on multiple public aerial datasets.

Table 1. Road dataset information in UAV images.

Dataset	Labeled	Road-H	Road-P	No Labeled	Total
Training set	1038	1236	675	162	1200
Validation set	346	476	187	54	400
Testing set	346	523	221	54	400

Table 2. Ablation study results of YOLOv12 variants in terms of accuracy metrics and inference time.

Models	P		R		F₁		IoU		Testing Time (ms)
Models	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	Testing Time (ms)
YOLOv12	0.831	0.758	0.832	0.723	0.831	0.740	0.833	0.647	8.3
YOLOv12 ¹	0.853	0.795	0.834	0.784	0.843	0.789	0.849	0.705	9.5
YOLOv12 ²	0.868	0.807	0.837	0.779	0.852	0.793	0.851	0.723	9.8
UAV-YOLOv12	0.913	0.845	0.891	0.805	0.902	0.825	0.891	0.795	11.1

(Note: YOLOv12 ¹ represents the baseline YOLOv12 model with the SKNet module; YOLOv12 ² represents the baseline YOLOv12 model with the PConv module).

Table 3. Detection results of UAV-YOLOv12 and state-of-the-art models.

Models	P		R		F₁		IoU		Testing Time (ms)
Models	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	Testing Time (ms)
U-Net	0.831	0.745	0.832	0.715	0.831	0.730	0.837	0.714	14.2
DeepLabV3+	0.861	0.774	0.827	0.758	0.844	0.766	0.848	0.695	21.8
Swin-UperNet	0.863	0.785	0.829	0.768	0.846	0.776	0.845	0.701	29.7
CMTFNet	0.872	0.811	0.835	0.775	0.853	0.793	0.852	0.719	23.2
YOLOV12	0.868	0.807	0.837	0.779	0.852	0.793	0.851	0.723	9.8
UAV-YOLOv12	0.913	0.845	0.891	0.805	0.902	0.825	0.891	0.795	11.1

Table 4. Visual comparison of detection results for different labels using UAV-YOLOv12.

Image	Labels
Image	Road-H	Road-P
Input images
Output images

Table 5. Classification results of UAV-YOLOv12 on other datasets.

Models	P		R		F₁		IoU		F₁	IoU
Models	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	Road-H	Road-P	F₁	IoU
Massachusetts Roads	0.854	0.817	0.842	0.795	0.848	0.806	0.845	0.812	0.826	0.823
UC Merced Land-Use Dataset	0.843	-	0.837	-	0.840	-	0.842	-	0.84	0.842
NWPU Dataset	0.875	-	0.867	-	0.871	-	0.873	-	0.871	0.873
Drone Vehicle Dataset	0.873	-	0.886	-	0.879	-	0.881	-	0.879	0.881

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, B.; Liu, Z.; Yang, Q. UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery. Drones 2025, 9, 533. https://doi.org/10.3390/drones9080533

AMA Style

Cui B, Liu Z, Yang Q. UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery. Drones. 2025; 9(8):533. https://doi.org/10.3390/drones9080533

Chicago/Turabian Style

Cui, Bingyan, Zhen Liu, and Qifeng Yang. 2025. "UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery" Drones 9, no. 8: 533. https://doi.org/10.3390/drones9080533

APA Style

Cui, B., Liu, Z., & Yang, Q. (2025). UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery. Drones, 9(8), 533. https://doi.org/10.3390/drones9080533

Article Menu

UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery

Abstract

Highlights

Abstract

1. Introduction

2. Methodology

2.1. Data Collection

2.1.1. UAV Detection

2.1.2. Data Preprocessing

2.1.3. Dataset Information

2.2. The Original YOLOv12

2.3. UAV-YOLOv12

2.3.1. PConv

2.3.2. SKNet

2.3.3. Loss Function

2.4. Evaluation Metrics

3. Results and Discussions

3.1. Abation Experiment Results

3.2. Experimental Results of Contrast Tests

3.2.1. Comparison of Detection Results with State-of-the-Art Models

3.2.2. Comparison of Detection Results of Different Labels

3.3. Generalization Performance Evaluation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI