1. Introduction
The rapid progression of unmanned aerial vehicle (UAV) technologies and associated mobile communication systems has emerged as a transformative development in modern sensing applications [
1]. These systems are particularly valued for their operational flexibility, extended coverage capabilities, and reliable imaging performance, enabling diverse implementations across critical domains including wildfire management, intelligent transportation systems (ITS), aquatic environmental monitoring, and urban infrastructure assessment [
2,
3]. The exceptional maneuverability and unobstructed aerial perspective of UAV platforms render them particularly effective for situational awareness enhancement in both routine operations and emergency response scenarios. Specific applications demonstrate their versatility, as follows: in wildfire scenarios, UAVs facilitate access to otherwise inaccessible areas for improved fire surveillance and damage evaluation [
4]; in transportation networks, they enable comprehensive traffic monitoring for optimized flow management [
5]; in environmental protection, hyperspectral-equipped UAVs provide vital water quality assessment through chromatic anomaly detection [
6]; while in urban contexts, they support critical infrastructure inspection and regulatory enforcement.
The technical requirements for highway perception systems demand particularly precise lane marking segmentation and extraction methodologies, as these processes fundamentally enable traffic violation identification, flow analysis, and driving behavior assessment [
7,
8,
9]. Beyond immediate monitoring applications, these capabilities prove essential for generating high-definition lane-level maps [
10,
11]—a critical component for emerging autonomous navigation systems, dynamic map updating protocols, and advanced route planning algorithms. Enhanced lane marking segmentation directly contributes to vehicle localization accuracy, thereby improving decision-making reliability in autonomous driving systems. Furthermore, the analytical potential of robust lane extraction extends to macroscopic traffic optimization through lane utilization pattern analysis, offering substantive benefits for congestion mitigation and infrastructure efficiency [
12,
13,
14,
15,
16]. From a regulatory perspective, automated lane marking detection provides quantifiable evidence for traffic regulation compliance, significantly enhancing enforcement capabilities and road safety metrics.
Currently, research pertaining to lane detection and segmentation predominantly centers on autonomous driving applications, wherein image data are typically acquired from a ground-level, horizontal perspective [
17,
18,
19,
20,
21]. In contrast, studies addressing lane feature extraction from unmanned aerial vehicle (UAV) viewpoints remain considerably limited [
22,
23]. The aerial perspective introduces several fundamental differences: UAV imaging captures comprehensive lane topologies including complex configurations (merges, diverges, and interchanges) across the entire roadway width, whereas ground-level systems typically observe only immediate adjacent lanes with relatively linear geometries. Furthermore, the operational focus of terrestrial systems on immediate navigation needs creates significant methodological incompatibilities for aerial data processing. These differences manifest as unique technical challenges including increased lane pattern variability, perspective-induced geometric distortions, and illumination artifacts characteristic of aerial acquisition—all demanding specialized algorithmic solutions.
Traditional lane detection approaches relying on handcrafted feature extraction (texture, chromatic, or edge-based) demonstrate limited robustness when confronted with aerial imaging challenges such as occlusions, structural complexities, or adverse environmental conditions [
24,
25,
26,
27,
28]. While these methods maintain utility in constrained scenarios, their performance degrades markedly in operationally relevant environments featuring complex junction topologies, dense traffic conditions, overhead structure interference, and multi-lane configurations. Furthermore, these conventional methods exhibit constrained scalability in multi-lane configurations due to inherent limitations in generalizability across diverse lane topologies.
To address these shortcomings, recent advancements have shifted toward deep learning paradigms that autonomously derive discriminative features through large-scale data-driven learning, thereby mitigating the dependency on manual feature engineering while achieving enhanced performance in challenging environments. A pipeline for automatically extracting lane-level street maps from aerial imagery was introduced by He et al. [
22]. Their method operates in two main stages: lane and direction extraction in non-intersection areas using a segmentation model, followed by enumeration and connectivity validation of possible turning lanes at intersections via a classifier. This work primarily addresses urban road intersections, presenting a context distinct from highway scenarios. Similarly, a two-stage deep learning framework was developed by Jiawei Yao et al. for constructing lane-level maps from aerial images [
29]. The first stage produces lane segmentation and vertex heatmaps, while the second refines the results through vertex matching to generate topologically structured lane polylines. In another study, Seyed Majid Azimi et al. presented an approach based on wavelet-enhanced cost-sensitive symmetric fully convolutional neural networks to achieve accurate lane-marking segmentation from high-resolution aerial imagery [
23]. Additionally, Franz et al. proposed a method for automatic segmentation and 3D reconstruction of road markings from multi-view aerial images, combining an enhanced fully convolutional network for pixel-wise segmentation with a sliding-window least-squares optimization technique for precise 3D line-feature reconstruction [
30]. However, it should be noted that both studies [
23,
30] focus exclusively on road marking extraction—such as lane lines, stop lines, and symbols—and do not accomplish full lane-level segmentation. Although lane extraction from drone-captured aerial imagery has attracted some research interest, it remains underexplored compared to analogous studies in the autonomous driving domain. Furthermore, existing efforts often employ inconsistent evaluation metrics and heterogeneous datasets. Future work should therefore prioritize enhancing the robustness and accuracy of lane detection algorithms in complex environments when applied to aerial imagery acquired by drones.
Our methodological framework addresses these limitations through a multi-stage processing pipeline: initial deep learning-based semantic segmentation generates precise road and lane marking extractions, with subsequent geometric processing employing constrained polynomial fitting for outlier-resistant lane modeling. The system specifically incorporates connected component analysis for lane segment isolation, iterative classification for fragmented marking association, and context-aware lane topology reconstruction. This comprehensive approach demonstrates particular effectiveness in handling real-world complexities including solid/dashed marking patterns and segmentation inconsistencies, ultimately producing accurate lane-level roadway models suitable for advanced transportation applications.
2. Related Work
2.1. Road and Lane Marking Segmentation Model
Conventional approaches to lane marking extraction predominantly depend on hand-engineered feature representations, including histogram distributions, color space transformations, and texture descriptors, typically integrated with edge detection operators and linear feature extraction techniques for lane segment identification. Recent advances in computer vision have witnessed a paradigm shift toward deep learning methodologies, with convolutional neural networks demonstrating particular efficacy in both object detection and semantic segmentation applications for lane analysis. Whereas object detection frameworks localize lane segments through axis-aligned bounding box regression, semantic segmentation architectures achieve superior precision through dense pixel-wise classification, effectively addressing the geometric limitations imposed by perspective distortion in oblique aerial imagery. This pixel-accurate segmentation capability proves particularly advantageous for lane marking extraction, as the non-rectilinear geometry of projected lane segments in UAV-acquired imagery frequently violates the rectangular bounding box assumption fundamental to conventional object detection approaches.
A variety of network architectures have been developed for semantic segmentation. In this study, we provide a concise comparison among STDC (Short-Term Dense Concatenation) [
31], BiSeNet [
32], ESPNetV2 [
33], and D-LinkNet [
34], ultimately selecting the STDC network for our task. The STDC architecture incorporates an innovative Short-Term Dense Concatenation backbone that preserves rich multi-scale features while progressively reducing computational channels, leading to a significant improvement in computational efficiency. Its dual-branch design—comprising a spatial path and a context path—inherits the strengths of BiSeNet, effectively combining high-resolution spatial details with deep semantic information to ensure high segmentation accuracy. Compared to D-LinkNet, STDC is lighter and faster; relative to BiSeNet, it employs a more efficient custom backbone; and when compared to ESPNetV2, it achieves superior accuracy while maintaining comparable inference speed. Consequently, STDC presents an ideal solution for real-time semantic segmentation applications that require high accuracy.
The semantic segmentation architecture used by us, illustrated in
Figure 1, builds upon the STDC network framework. The model employs a hierarchical backbone network that progressively reduces spatial resolution through five consecutive downsampling stages, each halving the feature map dimensions while expanding channel capacity. Critical intermediate features extracted from the third (feat3), fourth (feat4), and fifth (feat5) stages undergo specialized processing: the deepest features (feat5) first undergo global context aggregation through global average pooling (GAP) to produce the compressed representation “avg”. Concurrently, multi-scale features from stages four and five (feat4, feat5) are refined through channel-wise attention mechanisms, generating enhanced representations feat4arm and feat5arm, respectively.
The network then implements a cascaded feature integration strategy, where the global context vector “avg” is combined with the attended high-level features (feat5arm) and upsampled 2× to produce feat5up. This intermediate representation is fused with the attended mid-level features (feat4arm) through element-wise summation, yielding the composite feature feat4fuse. The architecture further integrates lower-level details by combining feat4fuse with the stage three features (feat3) through a dedicated feature fusion module (FFM), which employs both spatial and channel attention to optimally combine multi-scale information. The resultant fused feature map (ffm) undergoes 8× bilinear upsampling to restore original input resolution before final pixel-wise classification. This carefully designed feature hierarchy and fusion strategy enables the STDC network to maintain segmentation precision while operating at real-time speeds.
2.2. Definitions Related to Lane Markings
We aim to generate a comprehensive lane distribution map that partitions the roadway into distinct yet topologically connected regions through precise lane marking segmentation. The fundamental geometric elements comprising our framework—including points, lines, and surfaces—require clear definition to avoid ambiguity. For conceptual clarification, we present in
Figure 2 a binary segmentation mask derived from UAV imagery, where lane segments are represented as black regions against a white background denoting non-lane areas. With reference to
Figure 2, we formally define the key concepts related to lane segmentation:
Lane Segment: Defined as a topologically closed, connected component representing a continuous physical lane marking segment in the image domain. As exemplified in
Figure 2 (segments 1–10), these manifest as either solid boundary demarcations (e.g., segments 1,5,6,10) or intermittent dashed markings. While projective geometry dictates these segments should theoretically form perspective-transformed quadrilaterals, practical segmentation artifacts typically introduce non-linear boundary perturbations.
Skeleton Line: Derived through medial axis transformation of each lane segment (e.g., line AB in segment 3), serving as a reduced-dimensional representation that preserves the essential linear characteristics of the original region. Under ideal conditions, these skeletal representations would exhibit perfect linearity; however, empirical observations reveal measurable deviations attributable to segmentation noise and environmental factors.
Lane Marking: Constructed through topological concatenation of all skeleton lines belonging to a continuous physical lane marking. For instance, segments 2,3,4 in
Figure 2 collectively constitute lane marking Line_a (represented by polyline CD). Each skeleton line maintains bijective correspondence with exactly one lane marking entity, with centerline markings typically comprising multiple discrete skeletal segments while boundary markings often form continuous linear features.
Lane Region: Defined as the drivable surface area delimited by adjacent lane markings (e.g., the region bounded by Line_a and Line_b). This constitutes our primary extraction target, with the known geometric properties of standardized highway lanes (particularly their uniform width characteristics) serving as critical validation metrics during the reconstruction process.
Lane Completeness: Enforced through strict topological constraints requiring both correct classification of all lane segments to their respective markings (e.g., proper association of segments 2–4 to Line_b) and maintenance of full image-spanning continuity (with terminal points intersecting image boundaries). This completeness criterion ensures robust reconstruction of the complete lane network through proper topological connections between adjacent markings.
2.3. Dataset Description
Our dataset comprises aerial imagery captured by unmanned aerial vehicles (UAVs) along highway corridors, with flight altitudes maintained between 50 and 150 m to ensure optimal lane coverage and centering within the image frame. The UAV flight trajectory was carefully aligned with the highway centerline to maintain consistent perspective geometry. As illustrated in
Figure 3, all images were captured from videos acquired using high-resolution cameras (1920 × 1080 pixels or higher), with capture angles constrained to less than 45 degrees from either the forward-looking or nadir (vertical downward) directions. The experimental setup utilized a DJI Matrice 300 unmanned aerial vehicle (UAV) equipped with a Zenmuse H20 imaging system mounted on an integrated gimbal. Video data was captured at a frame rate of 30 fps. The dataset consists of 1128 images in total, of which 10% were randomly selected as the validation set, with the remainder used as the training set.
In the collected images (
Figure 3), lane markings (including solid lines, dashed lines, etc., but excluding arrows or diversion markings) and road regions are annotated.
The dataset annotations, as demonstrated in
Figure 3, include two primary classes: (1) lane markings (encompassing both solid and dashed delineators, while explicitly excluding directional arrows and diversion markings) and (2) drivable road regions. The left panel of
Figure 3 presents a representative raw image, while the right panel displays its corresponding annotated counterpart, where lane markings and road areas are precisely delineated for subsequent analysis.
3. Methodology
The overall workflow of our proposed algorithm is illustrated in
Figure 4. The input semantic segmentation mask, produced by the STDC network (
Figure 1), is subjected to connectivity analysis to isolate distinct lane segments and extract their skeletal structures. These resulting skeletons are then approximated using polynomial fitting. Based on the positional relationships between these skeleton lines, we classify them into co-linear groups and ultimately divide the road into distinct lane regions.
3.1. Connectivity Analysis and Skeleton Line Extraction
The semantic segmentation mask treats lane markings as foreground and other regions as background. Assuming the extracted lane markings are solid, their contours can be represented using isolines. However, segmentation errors may cause discontinuities in inherently continuous lane markings. To address discontinuities caused by segmentation errors, we apply morphological dilation to the mask.
For precise geometric characterization, we employ OpenCV’s findContours algorithm to extract the boundary contours of each lane segment. This computational geometry approach operates by (1) detecting connected components within the binary image space, and (2) tracing continuous pixel sequences along intensity transitions to construct closed polygonal approximations of each contour.
Recognizing the fundamental continuity of lane structures, we subsequently derive medial axis representations (skeleton lines) for each segmented region. These one-dimensional descriptors, formally defined as ordered point sets {p1, p2,..., pn} ∈ ℝ2, provide a compact topological representation that preserves the essential geometric characteristics while converting the two-dimensional region into its dimensional-reduced form.
Lane segments exhibit characteristic quasi-linear, strip-like spatial distributions in the image plane. The extracted skeleton line inherently partitions each segment contour into two distinct boundary components—for vertically oriented segments, these are denoted as the left,
, and right boundaries,
. To optimize computational efficiency during skeleton extraction, we adopt a simplified representation:
where
x and
y represent horizontal and vertical coordinates in the image plane. As demonstrated in
Figure 5, this approximated skeleton extraction method effectively preserves the geometric properties of lane segments while significantly reducing computational overhead.
3.2. Preliminary Fitting and Classification of Lane Skeleton Lines
Lane markings exhibit specific linear patterns. Here, we use parabolic or spline models to fit lane markings, representing them with quadratic or cubic polynomial parameters. Thus, lane marking fitting translates into solving polynomial parameters.
Figure 6a shows the lane marking mask obtained through semantic segmentation. After extracting skeleton lines for all lane segments, we sort them by length in the Y-direction. Starting with the longest skeleton line, we fit all skeleton lines using a quadratic function:
Here, the points (x, y) on the skeleton line are substituted into Equation (2) to form a system of equations. The least-squares solution for (c0, c1, c2) yields the lane marking’s fitted parameters.
Next, we classify lane markings. Let the parameters of the longest skeleton line be
c0,
c1, and
c2 These parameters are used to fit the remaining skeleton lines, with the fitting error denoted as
e:
where
represents the coordinates of the
ith point on the
jth skeleton line,
N is the total number of points, and
ej is the average fitting error for that skeleton line. If
ej is below a threshold (set to 20% of the average lane width), the skeleton line is classified as belonging to the current lane marking; otherwise, it is excluded. This process repeats until all skeleton lines are classified.
Each skeletal line either independently forms a complete lane marking or belongs to exactly one lane marking, with strict exclusivity in the fitting process. Once a skeletal line has been successfully incorporated into a particular lane marking during the fitting procedure, it is excluded from subsequent fitting iterations. As illustrated in
Figure 6b, which demonstrates the iterative fitting and classification results, we observe three distinct patterns: (1) longer, continuous skeletal lines (e.g., skeletal lines 0 and 1) independently constitute complete lane markings; (2) certain lane markings (e.g., 6 and 13) are composed of multiple shorter skeletal segments; and (3) segmentation artifacts cause some short skeletal lines (e.g., 8, 9, and 10) to become erroneously detached from their original lane markings, resulting in spurious independent markings. The initial fitting results, designated as the preliminary lane marking set (laneSet0), can be formally represented as shown in
Table 1.
3.3. Lane Marking Optimization
3.3.1. Removal of Isolated Lane Markings
In practical implementations, semantic segmentation outputs may erroneously classify non-lane objects with similar visual characteristics (e.g., utility poles or guardrails exhibiting high luminance values) as valid lane markings. To address this false positive detection issue, we implement a geometric validation step that systematically examines all candidate lane markings in the detection set. Specifically, we eliminate spurious detections when a lane marking simultaneously satisfies the following conditions regarding its vertical extent and curvature characteristics: (1) the normalized vertical span Ylen/iH falls below a predetermined threshold (as specified in Equation (4)), and (2) the second-order polynomial coefficient c2, which quantitatively characterizes the curvature of the fitted lane model, exceeds a critical value.
The parameter
c2 serves as a direct metric for lane curvature, where
c2 = 0 indicates perfect linearity (zero curvature) and |
c2| > 0 corresponds to increasingly curved geometries. This validation criterion effectively filters out short, highly curved segments that are statistically unlikely to represent genuine lane markings in highway environments. Formally, the elimination condition can be expressed as:
where
τ1 and
τ2 represent empirically determined thresholds for vertical extent and curvature, respectively. The parameter
τ1 is employed to filter out excessively short lane markings, whereas
τ2 is utilized to detect curved lane skeleton lines. Instances such as vehicles or road signs on highways can be misclassified as lane lines by the semantic segmentation network. The segmentation masks of these objects often form compact patches instead of elongated structures, leading to a high quadratic coefficient
c2 after polynomial fitting of their skeletal representations. These two criteria are combined to effectively remove such false positive lane segments. In our implementation, the threshold
τ1 for
Ylen/iH is set to 0.1, and
τ2 is set to 1.0 × 10
−4.
3.3.2. Merging of Incomplete Lane Markings
The proposed methodology implements a systematic validation process to identify and rectify fragmented lane markings within the preliminary lane set laneSet0, where adjacent segments belonging to the same physical lane may be incorrectly partitioned due to fitting sequence artifacts or continuity disruptions. As exemplified in
Figure 6b by lane markings 12/13 and 8/9, such fragmentation occurs when geometrically continuous lane markings are erroneously segmented into discrete components during the initial fitting stage. To address this issue, a formal completeness criterion is established where fully continuous lane markings must span the entire vertical image dimension, with normalized vertical coordinates satisfying
y0 = 0 and
y1 =
iH (image height).
The detection of potentially fragmented lanes proceeds through selective screening of candidates exhibiting vertical discontinuity, specifically those with either
y0 > 0.1
iH (incomplete upper termination) or
y1 < 0.9
iH (incomplete lower termination). For each identified candidate, a comprehensive matching procedure evaluates all geometrically compatible lane segments in the remaining set that maintain non-overlapping vertical spans, employing a localized fitting approach to assess continuity. As demonstrated in
Figure 6b for lanes 8 and 9, this involves applying the parametric model of the reference lane (8) to a strategically selected subset of points from the candidate segment (9), focusing on proximal regions where continuity is most probable.
The merging decision utilizes an adaptive error metric comparing the localized fitting residual against a dynamically determined threshold, accounting for the expected curvature variation in authentic lane markings.
3.3.3. Boundary Lane Marking Refinement
Quadratic polynomial fitting generally provides satisfactory approximation accuracy for lane marking geometry in standard highway segments, serving as the primary mathematical model for initial lane fitting. However, this representation demonstrates limited capability in accurately capturing the complex curvature profiles exhibited at roadway bifurcations or along boundary lanes with pronounced arc geometries, resulting in significantly increased fitting residuals. Such limitations necessitate a comprehensive error evaluation protocol examining the approximation accuracy across all constituent segments of each lane marking.
The proposed framework implements a localized error assessment strategy where segments exhibiting fitting residuals exceeding established thresholds undergo secondary modeling using spline-based representations. As illustrated in
Figure 6d, the boundary lane marking 4 demonstrates superior geometric fidelity when modeled with spline functions compared to its parabolic approximation shown in
Figure 6c, particularly in regions of high curvature variation. This adaptive approach maintains computational efficiency through selective application of higher-order models, only where necessitated by local geometric complexity.
3.4. Lane Marking Closure and Lane Division
3.4.1. Lane Marking Completion and Continuity Enforcement
In this critical phase of our methodology, we rigorously ensure the topological completeness of all detected lane markings through a multi-stage refinement process. The integrity of lane markings is paramount for accurate lane delineation, particularly in UAV-captured imagery where perspective distortion and segmentation artifacts frequently disrupt linear continuity.
- (1)
Geometric Completeness Verification
We systematically examine each extracted lane marking for geometric continuity by evaluating its vertical span within the image coordinate system, classifying them into three distinct categories based on their termination points: a lane marking is considered fully continuous when it spans the entire vertical dimension from the top (y0 = 0) to the bottom (y1 = image height) of the image; it is identified as head-truncated when the upper termination point (y0) is displaced from the image boundary; and categorized as tail-truncated when the lower termination point (y1) fails to reach the base of the image, with each classification informing our subsequent processing strategies for lane marking completion and refinement.
- (2)
Adaptive Extension Algorithm
For markings with minor discontinuities where the vertical gap is less than 10% of the image height (
δy < 0.1·
iH), we employ polynomial extrapolation based on the established curve parameters. As shown in
Figure 6d, lane marking 5 presents a representative case where the bottom segment does not fully extend to the image boundary, yet the missing portion remains sufficiently limited (
δy < 10% of
iH). In such scenarios, we seamlessly extend the lane marking to the image bottom by applying its original polynomial parameters (
c0,
c1,
c2 from Equation (2)), thereby maintaining the inherent curvature characteristics without introducing additional fitting errors or distortions.
This methodology demonstrates particular effectiveness in addressing several common challenges encountered in aerial lane detection. It reliably handles short gaps resulting from temporary occlusions, compensates for local segmentation failures when flanking regions exhibit consistency, and corrects perspective-induced thinning effects at image boundaries where the overall trajectory remains stable. The approach maintains robust performance across varying image resolutions and flight altitudes.
The extrapolation process ensures geometric coherence through multiple safeguards. It strictly preserves the original polynomial coefficients during extension, cross-validates each extension against parallel lane constraints to maintain road geometry, and implements an extension length limit of 10% of the image height (δy_max = 0.1·iH) to prevent artifacts from over-extension.
- (3)
Lane Marking Merging
For lane markings that exhibit bifurcation, such as lane marking 8 in
Figure 6d, where the bottom portion is significantly incomplete with substantial missing segments, it becomes necessary to establish connections with other lane markings that maintain complete bottom continuity. The process for identifying suitable connecting lane markings follows the same methodology as the lane merging approach described in
Section 3.3.2, with the key distinction being the absence of a threshold constraint, in this case. Instead, the optimal connection is determined by selecting the lane marking with the minimal fitting error. As illustrated in
Figure 6d, lane marking 8 requires connection to lane marking 5.
The connection between lane markings 8 and 5 is established by directly linking point A (the bottommost point of lane marking 8) to point B on lane marking 5. While point A can be readily identified as the terminal point of lane marking 8, the determination of point B involves consideration of two critical factors: first, whether the candidate point lies along the projected extension direction of lane marking 8, and second, the spatial distance between points A and B. The directional alignment can be evaluated through the fitting error of lane marking 8, while the spatial proximity can be assessed by measuring their vertical (
Y-axis) separation distance. In practical implementation, we employ a weighted formula to define a deviation metric:
where normalized distance and fitting error are represented by symbols,
ηi and
φi, respectively, the subscript
i denotes the
ith point on the lane segment, the symbol,
λi, represents the deviation index relative to lane marking g with higher values indicating greater deviation, and the point with the minimal symbol value is identified as the optimal connection point B.
3.4.2. Lane Division Based on Global Road Information
Following completion of the lane marking refinement procedure, the final lane partitioning is established through a systematic spatial organization of all validated markings. Precise positional ordering is implemented by first sorting markings according to ascending starting X-coordinates in a left-to-right sequence, with secondary sorting based on ending X-coordinates to resolve branching pattern ambiguities. This deterministic arrangement ensures that the planar region delimited by any two adjacent markings in the ordered sequence constitutes a distinct navigable lane, thereby maintaining complete coverage of drivable surfaces while preserving roadway network topology.
Specialized processing addresses two critical edge cases during the division process. Emergency lane identification combines road segmentation masks with lane width consistency metrics to compensate for absent boundary markings, while non-navigable zones including medians and vegetated areas are excluded through statistical analysis of road pixel density within candidate regions. This dual validation methodology, integrating geometric ordering with semantic verification, demonstrates consistent performance across diverse highway configurations and imaging conditions while adhering to established roadway design standards.
4. Experimental Analysis
Our research addresses the critical task of highway lane extraction from UAV imagery through a three-stage computational framework comprising road segmentation, lane marking extraction, and lane division. The semantic segmentation component proves particularly crucial, as its performance directly determines the accuracy of subsequent processing stages. Consequently, our evaluation methodology encompasses both the intermediate semantic segmentation results and final lane extraction outputs.
To optimize the semantic segmentation performance, we implement an enhanced STDC network architecture incorporating dual loss optimization. The primary segmentation loss employs Online Hard Example Mining (OHEM) to prioritize challenging samples during batch training, while the complementary boundary loss utilizes Laplacian-based edge detection to generate precise boundary labels. This combined approach simultaneously optimizes both region coherence and boundary localization through integrated binary cross-entropy and Dice loss minimization.
We adopt mini-batch Stochastic Gradient Descent (SGD) for parameter optimization, with momentum and weight decay set to 0.9 and 5 × 10
−5 respectively. The learning rate, initialized at 0.01, follows a warmup strategy for the first 1000 iterations. Subsequently, the learning rate decays according to the formula: initial learning rate ×
, where
power = 0.9.
Figure 7 illustrates the learning rate and loss curves throughout training. After 43,000 iterations, the loss function stabilized with a batch size of 8, indicating successful model training. The hyperparameters associated with the learning rate and optimizer were primarily based on references [
34,
35]. A batch size of 8 was selected after empirical evaluation of multiple options, including 4, 8, and 16.
To enhance model generalization, we implement multiple data augmentation techniques including color jittering [
36], random horizontal flipping [
36], random cropping [
37], and random resizing [
38,
39]. The original image size is 1920 × 1280. Higher resolution generally results in longer training and inference times for the model, yet it also leads to improved segmentation accuracy. To strike a balance between accuracy and computational cost while preserving the original aspect ratio, we resized the images to 1080 × 720. The training and evaluation environment utilizes the PyTorch 1.9.1 framework with CUDA 11.1 enabled for accelerated computing. The system is equipped with an Intel (R) Xeon (R) Gold 5218R CPU @ 2.10GHz and an NVIDIA GeForce RTX 3090 GPU.
For semantic segmentation evaluation, we employ two standard metrics: Intersection over Union (IoU) and Accuracy (Acc). IoU measures the overlap between predicted and ground truth regions for each category, while Acc calculates pixel-wise classification accuracy. The IoU is mathematically defined as:
where X is the pixel set of a specific category in the ground truth, Y is the corresponding pixel set in the prediction results, |X ∩ Y| denotes the intersection of X and Y, and |X| and |Y| represent the cardinalities of sets X and Y, respectively. As shown in
Table 2, the STDC model outperforms the other two compared models, achieving an IoU of 90.97% in road segmentation and 74.03% in lane marking segmentation, along with an overall weighted IoU (FwIoU) of 96.53% and a pixel accuracy of 98.17%.
Distinct from conventional semantic segmentation approaches for road and lane marking identification, our lane extraction framework addresses an instance segmentation challenge demanding accurate differentiation between individual lane markings. Accordingly, we employ recall (sensitivity), precision, and F1-score as our primary evaluation metrics. Following established practices in object detection, a lane prediction is considered successful when the Intersection over Union (IoU) between the predicted lane mask and ground truth mask exceeds a predetermined threshold, with the constraint that each ground truth lane can be matched to at most one predicted lane.
Table 3 presents the lane prediction performance across varying IoU thresholds. Notably, at an IoU threshold of 0.8, our method achieves comprehensive performance with recall exceeding 90%, precision above 90%, and
F1-
score maintaining over 90%. The formal definitions of these metrics are:
where
TP (True Positive) represents correctly identified lane markings,
FP (False Positive) indicates erroneously detected non-lane regions,
FN (False Negative) denotes missed actual lane markings,
TN (True Negative) refers to correctly rejected background areas.
Table 3 presents performance under different IoU thresholds. At IoU = 0.5, our method achieves 93.25% recall, 94.98% precision, and 94.11%
F1-
score, with mean IoU reaching 93.84%. These results demonstrate robust performance across various matching criteria.
Figure 8 showcases qualitative results on two typical UAV images. The right example demonstrates accurate lane extraction from low-altitude imagery where lane markings appear clearly. The left example shows our method’s capability in handling more challenging high-altitude scenarios where lane markings become smaller and less distinct, while still maintaining reliable performance. These visual results complement the quantitative metrics, comprehensively validating our approach’s effectiveness in practical applications.