An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles

Kim, Donghyun; Chang, Dong Eui

doi:10.3390/drones10040270

Open AccessArticle

An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles

by

Donghyun Kim

and

Dong Eui Chang

^*

School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(4), 270; https://doi.org/10.3390/drones10040270

Submission received: 24 February 2026 / Revised: 28 March 2026 / Accepted: 4 April 2026 / Published: 8 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose an onboard, integrated perception–control perching system that detects a 3D hurdle structure at long range and autonomously perches a quadrotor on the hurdle’s markerless horizontal bar. The full pipeline is computationally lightweight and runs in real time on an onboard Jetson Orin NX.
We enhance a YOLOv12n-based hurdle perception model by adding a high-resolution P2 pathway and applying lightweight module replacements. A state-machine controller enables robust perching from long-range small targets to unstable close-range partial views.

What are the implications of the main findings?

The proposed model improves small-object detection performance and efficiency over baselines on VisDrone and a custom hurdle dataset, and outdoor flight experiments demonstrate successful perching in all six tests.
By enabling markerless perching without a gripper or fiducial markers, using only a single-axis gimbal and fixed camera, 1D altitude LiDAR, and a short-range ToF sensor, the system provides practical design guidelines for UAV perching with minimal hardware augmentation, supporting long-endurance unmanned missions where energy savings are critical.

Abstract

This paper presents an onboard, markerless perching system for a quadrotor UAV, validated in outdoor flight experiments, to reduce hovering energy during long-endurance unmanned missions. Existing autonomous landing research predominantly focuses on planar surfaces, cooperative environments with visual markers, or specialized hardware, limiting scalability to scenarios requiring detection and perching on thin rod-like targets in uncooperative outdoor settings. This study proposes a markerless perching system for autonomously perching a drone on a hurdle’s horizontal bar. The system employs a single-axis gimbal camera, altitude LiDAR, and ToF sensor, integrating perception, post-processing, and control. On the perception side, we augment a YOLOv12n-based segmentation model with a high-resolution P2 pathway for small-object detection and apply module compression for real-time inference on edge devices. Robustness is improved by jointly utilizing the full hurdle and horizontal bar while constructing negative samples to suppress false positives. On the control side, a state machine controller leverages centroid coordinates, orientation, and distance measurements to achieve a stable long-range approach and precise close-range alignment. Experiments on a Jetson Orin NX-based system demonstrate successful perching in all six outdoor flight tests. Ablation studies quantitatively analyze each component’s contribution to perching success rate and completion time. This research validates perching technology’s practical applicability through outdoor markerless perching on thin 3D structures.

Keywords:

drone perching; small-object detection; markerless; visual servoing

1. Introduction

Recent advances in unmanned aerial vehicle (UAV) technology have significantly expanded the application scope of drones. However, limited battery capacity remains a major constraint for long-duration mission operations. To address this limitation, autonomous perching technology, which reduces energy consumption by staying on structures, has gained attention. Especially in scenarios requiring extended flight time, such as power line inspection and long-term wildfire monitoring, perching is a key technology that can greatly improve operational efficiency [1,2].

Unlike conventional ground landing, perching requires the UAV to detect thin, spatially limited structures such as poles and bars from a distance, approach them precisely, and settle on them stably. Outdoor environments make this problem substantially harder. They do not provide artificial markers, and they combine illumination changes, motion blur, occlusion, background clutter such as pipes and branches, and wind disturbances. These factors increase uncertainty in both perception and control, making practical outdoor perching difficult [3,4]. As a result, an outdoor perching system must simultaneously provide (i) robust structure detection in uncooperative environments, (ii) stable guidance and descent under perception uncertainty, and (iii) real-time operation under limited onboard computational resources.

This study proposes an autonomous perching system that allows a drone to land on the horizontal bar of a hurdle structure. The system integrates the full perching pipeline within a single framework. It detects and tracks the target using onboard vision and avoids separate guidance fixtures or mechanical locking devices beyond a single-axis gimbal and a distance sensor. The final objective is to place the hurdle bar between the four landing gears so that stable perching can be achieved through attitude control alone, without physical coupling.

This objective leads to two tightly coupled challenges. First, the actual perching target is the horizontal bar, but at long range, the bar is difficult to distinguish because of its simple geometry and its visual similarity to background structures such as pipes or branches. To improve robustness across distances, we detect both the full hurdle and the horizontal bar and use both outputs in control. This perception strategy is supported by hard-negative sampling, P2 feature utilization, and selective module lightweighting so that the detector can improve long-range small-object recognition, preserve close-range precision, and still satisfy real-time onboard constraints. Second, the recognition outputs must be transferred to the controller without delay or instability. We therefore integrate a hurdle-specific detection model with a PID-based control loop, where the model is trained on approach video frames captured by the onboard camera, and the controller uses the estimated hurdle position and orientation to execute the perching maneuver.

Based on these problem settings, the proposed approach adopts several design decisions that reflect both perching requirements and platform constraints. We build on the YOLO family because it offers a strong speed–accuracy trade-off and a mature deployment ecosystem, and we select YOLOv12n as a lightweight baseline that can be adapted for this task through a P2 path and selective module simplification. Rather than detecting only the crossbar, the perception module jointly recognizes the hurdle frame as the whole object and the crossbar as its part. This whole-part formulation reduces scale-dependent ambiguity and helps suppress false positives from visually similar bar-like structures, enabling a more reliable transition from long-range approach to near-range alignment.

On the control side, we implement a modular state-machine architecture that separates the theoretically grounded reaching phases from the empirically tuned perching phases. This structure simplifies implementation and tuning while preserving explicit failure-handling logic for outdoor operation, where available sensor information changes with distance and flight condition. The hardware configuration is also deliberately minimal: the system relies on standard UAV components, including stereo cameras, a gimbal system, and a 1D LiDAR, and adds only a single ToF sensor. As a result, the platform provides the sensing functions required for perching while keeping hardware dependency low and maintaining portability to existing drone systems.

The contributions of this paper are as follows:

General-purpose perching system with low hardware dependency: We implement perching on a standard F450 frame without adding special mechanical structures. The system uses only one extra ToF sensor in addition to a gimbal camera and LiDAR for altitude estimation.
Lightweight model for small-object recognition: Based on YOLOv12n [5], we design a model for small-object detection with a computational cost of about 10 GFLOPs. The model adds a P2 path and applies selective module lightweighting.
Real-environment dataset construction: We construct a hurdle dataset that includes a range of distances, viewing angles, lighting conditions, and hurdle-like distractors, and we use it for training.
Integrated perception and control on embedded hardware: We implement the full autonomous perching system so that detection and control operate in real time on a Jetson Orin NX board.
Real-flight verification and ablation study: We evaluate perching performance through outdoor flight experiments and use ablation studies to identify the contribution of each major control component.

The remainder of this paper is organized as follows. Section 2 reviews related work and clarifies the gaps addressed in this study. Section 3 then presents the hurdle recognition model, and Section 4 describes the autonomous perching algorithm that uses the recognition outputs. Section 5 reports the experimental results, including both quantitative performance and representative failure cases. Finally, Section 6 summarizes the main findings and discusses limitations.

2. Background

2.1. Autonomous Perching and Landing Systems and Hardware

Autonomous perching is an important capability for reducing energy consumption during missions and for enabling long-term observation at fixed locations. Existing research can be broadly divided into two streams: (i) hardware mechanisms that physically attach the UAV to a structure, and (ii) perception-control algorithms that guide the UAV to the target. In the first stream, researchers have proposed dedicated hardware such as grippers, needles, spines, and adhesive mechanisms [1,6,7]. Representative examples include microspine-based wall attachment for fixed-wing UAVs [8], hugging-wing methods that land on vertical poles after collision using wing deformation [6], and quadcopters that attach to walls or ceilings through adhesion and ceiling effects [7]. However, these studies focus mainly on stability after attachment. They address detection and autonomous approach in uncooperative environments much less directly.

The second stream includes vision-based autonomous landing on both static targets, such as roads and platforms, and dynamic targets, such as vehicles and ships. The level of difficulty depends strongly on whether the environment is cooperative, with markers such as “T” or “H” patterns, or uncooperative, where markers cannot be installed easily [2]. For example, Prachand et al. [9] used Fractal ArUco markers to land on ship decks moving with waves and determined the landing timing from marker-based 3D motion estimation. Nguyen et al. [10] proposed a safe landing strategy that detects and avoids people using panoramic sensing and YOLO-based human recognition. Other studies have addressed runway detection with vision/IMU fusion [11] and YOLO-based control for dynamic moving platforms [12]. Nevertheless, these approaches are generally designed for planar landing on the ground, decks, or pads. Many also assume cooperative conditions. Their direct extension to markerless perching on thin bars, therefore, remains limited.

2.2. Visual Servoing-Based Control

Visual Servoing is a technique that uses visual information obtained from cameras as feedback in control loops to guide robots to target states. While the previous subsection focused on what UAVs land on and what hardware they use, the present subsection focuses on how visual observations are converted into motion commands. Attempts to apply vision-based control to actual systems have been continuously reported in the UAV field [13,14,15]. He et al. [14] implemented and demonstrated image-based visual servo control on aerial platforms, and Chen et al. [15] showed that UAVs can track moving targets in real time through visual servoing. Meanwhile, Levine et al. [13] presented an end-to-end deep visuomotor policy that directly generates control commands from camera images, establishing a nonlinear mapping between vision and control in a learnable form. While these studies suggested directions for directly connecting visual feedback to controllers, cases of robustly integrating “detection-control” as a single system under conditions including motion blur, occlusion, illumination changes, and especially small object detection uncertainty occurring simultaneously in actual outdoor flight remain limited.

2.3. Small Object Detection and Lightweight Model Research

For applying deep learning-based recognition on platforms with limited onboard computational resources, such as drones, lightweighting and real-time performance are essential. In practice, however, control strategies are useful only if the target remains detectable throughout the maneuver. Especially in aerial imagery, targets occupy small areas in images and have large-scale variations, making small object detection performance crucial to overall system performance, including tracking stability and control-input reliability. Accordingly, various lightweight models and structural improvement techniques have been proposed to enhance small object detection while reducing computational load [16,17,18,19,20].

A small object detection survey [18] summarizes the difficulties in aerial imagery environments as follows: (i) low resolution and motion blur due to long-range shooting and compression/vibration; (ii) lack of features provided by small objects and mixture with background; (iii) detail information loss due to downsampling-centric CNN structures; (iv) increased missed detections when occlusion occurs; (v) evaluation structures where IoU becomes unfavorable as feature map resolution is lower than object size. To mitigate these problems, existing studies have generally proposed improvements in directions, such as (a) multi-scale feature fusion (FPN/PAN extension), (b) high-resolution feature (P2) utilization, (c) dynamic/adaptive detection heads, and (d) attention-based context enhancement.

VisDrone is widely utilized as a representative benchmark for aerial image-based object detection [21]. Aerial imagery includes numerous long-range small objects due to high-altitude shooting characteristics, object appearances are simplified due to top-down viewpoints, and background noise from arbitrary 3D directional rotations and ground surface textures/structures is mixed. The VisDrone-DET2019 challenge results report [3] points out that many existing models show performance degradation under these conditions, emphasizing the need for UAV domain-specific designs.

Recent studies have increasingly focused on simultaneously achieving real-time performance and accuracy in embedded environments by combining lightweight backbone/neck designs with high-resolution feature (P2) paths favorable for small objects. For example, SOD-YOLO is a YOLOv7-based lightweight detector that replaces the backbone with a structure composed of DSD and LFI modules and includes a P2 head for small object detection, improving accuracy while reducing latency on VisDrone [17]. Small Target-YOLOv5 introduced P2 heads and BiFPN to YOLOv5 and combined CARFB and DyHead to enhance receptive fields, improving small object performance [19]. As a transformer-based approach, Drone-DETR improved small object detection performance on VisDrone through a backbone including hybrid downsampling (MPD) and PConv-based blocks for detail preservation, along with P2 feature propagation and multi-scale attention [4].

Meanwhile, module combinations to achieve expressiveness while maintaining computational efficiency are also active. MFP-YOLO introduced MIRB and C3_CBAM, parallel involution-based PISPPF, Focal-EIoU loss, and decoupled heads based on YOLOv5s, improving performance while reducing computational load/model size [20]. ASG-YOLOv5 combined context integration through DCA with low-level feature enhancement and interference suppression through SGM, and applied regression loss combining NWD with CIoU, improving both small object accuracy and real-time performance on VisDrone and AI-TOD [22]. EUAVDet is a YOLOv8-based lightweight model that simultaneously improved accuracy and latency on VisDrone through downsampling path design (PWConv-2x2, MaxPool-PWConv), Multi-Kernel Aggregation, Faster Ghost Module, and Focused FPN [16].

The utilization of part-whole relationships has been extensively studied as an approach to improve not only recognition performance in complex environments but also stepwise control stability. Particularly in situations where only parts of objects become identifiable due to observation conditions or distance changes, strategies that distinguish and utilize whole structures and partial structures are known to be effective. As a representative example of utilizing part-whole relationships in the recognition stage, Chen et al. modeled objects as a graph structure composed of one whole node and multiple part nodes, and introduced switch variables that control part-by-part activation. Through this, they reported improved recognition performance under deformation, occlusion, and low-resolution conditions by selectively excluding parts with incomplete or occluded information and combining available part information [23]. These strategies are also utilized in similar forms in stepwise control problems. For example, Patruno et al. configured landing pads to include three stages of features by approach phase, first detecting targets through large structural features that are clearly visible in images at long distances, and implementing a landing that precisely estimates pose using internal detailed features at close distances [24]. Additionally, Lin et al. distinguished approach and alignment phases for UAV landing in outdoor low-light environments, showing that visual precision requirements differ by phase and experimentally validating the effectiveness of hierarchical recognition and control strategies [25].

2.4. Sensor Fusion Research for Autonomous Flight

For autonomous UAV flight in GNSS-limited or low-reliability environments, camera–IMU-based VIO/VINS has been studied extensively. Reliable perching, however, depends not only on estimation quality but also on how sensor availability changes with distance and environment. The observability and reliability of each sensor vary with lighting, geometry, and surface properties. Existing studies show that camera-based VIO/VSLAM can degrade under illumination changes and in low-texture environments [26]. This limitation motivates the use of LiDAR as an active sensor that is less sensitive to lighting changes [26]. LiDAR is not immune to failure, however. Its performance can also degrade under strong direct sunlight because of interference from highly reflective targets or backscattering [26].

Hao et al. [27] addressed scale drift in downward-looking camera–IMU VIO during long-term operation by combining a 1-axis range finder (1D LRF) with RVIO. Irfan et al. [28] integrated stereo cameras, an IMU, 3D LiDAR, and GPS, and showed that sensor reliability can change rapidly with illumination, texture scarcity, and GPS degradation or loss. They proposed an LSTM-based framework that adaptively adjusts sensor-fusion weights in real time. Abdelaziz et al. [26] reported an integrated navigation structure that first combines INS and LiDAR with an EKF and then progressively integrates stereo SLAM. Their approach reduces position RMSE in GNSS-denied environments and mitigates the cumulative drift of INS-only navigation.

Furthermore, the validity of stereo depth estimates changes across distance ranges. Continuous distance observation is important for landing and proximity maneuvers, but stereo depth can become unstable at close range because of reduced field-of-view overlap or minimum measurement-distance constraints [29]. Other depth-camera studies report the same limitation: depth is often unavailable in the nearest range because of the sensor’s minimum operating distance, which is typically on the order of several tens of centimeters [30]. Single-axis LiDAR also has limitations when it is used for altitude observation. Petrlík et al. [31] proposed a stabilization and estimation structure to reduce abrupt altitude-estimation changes when UAVs fly over obstacles. Pritzl et al. [32] showed that downward range measurements can degrade or be delayed because of ground-elevation changes or interference from the UAV undercarriage, and they proposed an altitude-estimation method that accounts for these delays and time-varying uncertainties. Overall, cameras, LiDAR, rangefinders, and ToF sensors are complementary, but none is uniformly reliable. In close-proximity flight around thin structures, the sensing strategy must therefore account for illumination, reflection, occlusion, and distance-dependent validity.

2.5. Research Motivation and Problem Definition

The literature reviewed above reveals several gaps in autonomous landing and perching research. Lightweight aerial detectors have mostly been developed on general-purpose benchmarks such as COCO and VisDrone. Many autonomous landing studies, in contrast, focus on planar landing, marker-based guidance, or cooperative environments. Markerless perching on thin 3D structures in outdoor environments poses a different problem. It requires reliable long-range observation, precise close-range alignment, and a stable transition between these phases. All of these requirements must be handled within a single perception–control loop. To the best of our knowledge, no prior study has addressed this integrated design problem in exactly this form. Simply combining existing technologies is not sufficient for an outdoor hurdle perching system, because approaches developed for planar landing or marker-based cooperative environments do not extend directly to this setting. Sensor availability also changes with distance. Cameras provide visual information such as target center and tilt, but stereo depth may become unreliable at close range because of minimum-distance constraints. A 1D LiDAR is useful for altitude observation, but structural occlusion can create abrupt measurement errors. A ToF sensor is useful for final distance verification immediately before landing, but its effective range is short. The central challenge is therefore to connect long-range approach, close-range alignment, and final landing while respecting these sensor-specific limitations.

These requirements also guide the perception-model design. FPN/PAN necks fuse backbone outputs at multiple resolutions, such as C2–C5, through upsampling and same-resolution aggregation. This allows each pyramid level to retain both semantic information and location detail. Small objects, however, may occupy only a few cells in the P3 grid, which weakens their features. For aerial imagery and perching scenarios, where long-range targets are often very small, adding a P2 level is therefore advantageous because it preserves higher-resolution C2 features. In addition, lightweight models are still needed to achieve both real-time operation and sufficient detection performance under onboard computational constraints. YOLOv12 achieves strong performance by repeatedly using A2C2f-based attention blocks in the backbone and neck. These blocks are heavier than earlier C3k2 blocks. Because our task involves a few classes and relatively low object complexity, replacing selected blocks with lighter modules is a reasonable way to reduce computation while limiting performance loss.

Motivated by these observations, we implement an integrated perception-control framework for markerless perching. The framework recognizes hurdles as whole structures and horizontal bars as parts so that the controller can use information appropriate to each distance range. It also combines detection post-processing with a state-machine controller that explicitly uses both whole and partial observations. The result is a unified markerless perching system that supports a stable long-range approach, precise close-range alignment, and real-time operation on an onboard embedded platform. With this motivation established, the next section focuses on the perception model that forms the front end of the full system.

3. Hurdle Recognition Model

The hurdle recognition model proposed in this study is based on YOLOv12n, a lightweight detector with a computational cost of approximately 10 GFLOPs [5]. We selected YOLOv12n because the YOLO family provides a strong speed–accuracy trade-off and is widely used in aerial imaging applications. Among recent YOLO variants, YOLOv12n offers a particularly favorable performance-to-size ratio on standard benchmarks such as COCO, which makes it suitable for resource-constrained onboard deployment.

At first glance, it may seem sufficient to detect only the hurdle’s horizontal bar, since that bar is the final perching target. In outdoor environments, however, many structures, such as pipes and tree branches, resemble horizontal bars. The bar itself also occupies only a small image region, especially at long range. For this reason, we train the model to detect both the whole hurdle and the horizontal bar. We then design the control pipeline to use both outputs. The standard YOLOv12n architecture is also limited in its ability to preserve fine features for distant hurdles. To address these issues, we modify the model architecture and construct a custom hurdle dataset. The final model is converted to a TensorRT engine and optimized for real-time inference above 10 FPS on the Jetson Orin NX edge board.

Figure 1 shows the detailed structure of the proposed hurdle recognition model. The modified parts compared to YOLOv12n are highlighted in red. The most significant structural modification is the introduction of the P2 feature level to improve small object detection performance. The existing YOLOv12n FPN-PAN structure uses P3–P5 level feature maps, but this lacks sufficient resolution to capture fine structures such as thin horizontal bars or vertical poles at long distances. To compensate for this, we add a branch that upsamples the P3 feature map in the Neck stage, combines it with the P2 features from the Backbone, and then downsamples it again to pass to upper layers. The P2 feature map maintains high resolution and makes a critical contribution to recognizing small objects at long distances.

However, the addition of the P2 feature level increases computational load and memory usage. To offset this and secure real-time performance on edge devices, this study replaces some A2C2f modules that do not significantly affect performance with lighter C3k2 modules, and selectively applies GhostConv modules [33,34] that can reduce the number of parameters while maintaining feature extraction capability. These module replacements and structural changes are determined through iterative training speed experiments using the COCO dataset.

The hurdle dataset constructed for training and validation targets four types of hurdles with identical specifications but different colors, and uses video frames captured under various weather, lighting, distance, and angle conditions. The dataset has the following main characteristics. First, we construct data reflecting the constraints of actual flight environments. By including numerous motion blur and long-distance low-resolution images that can occur during actual flight, we enable the model to operate robustly even with unclear inputs. Second, we employ negative samples for false positive suppression. To suppress false positive detection, we actively include in the training data two types of negative samples: (i) background images where no hurdles exist, and (ii) images containing confusing structures with shapes similar to hurdles (horizontal bars, gazebos, pipes, tree branches, drone landing gear, disassembled hurdle parts, etc.). In particular, 505 images in the training dataset are composed solely of such negative samples to induce the model to clearly distinguish between background noise and actual hurdles. Finally, the hurdle dataset consists of 1964 training images extracted from 14 videos filmed using 3 types of hurdles and 99 validation images extracted from 1 video filmed using 1 type of hurdle. The test dataset uses 237 images extracted from 4 videos containing 4 types of hurdles, adding 1 type of hurdle that is not used in training. Figure 2 shows the physical specifications of the hurdle used in the experiments. The hurdle is manufactured using 50 A specification PVC pipes, consisting of a 1 m long horizontal bar, a 0.7 m long vertical pole, and four 0.5 m pipes to support the structure from the ground. Four types of hurdles painted in different colors are utilized for constructing the training and evaluation datasets.

The model’s inference results are converted into state information (center position, inclination, distance) required for drone control through a post-processing algorithm. The post-processing algorithm is presented in Algorithm 1. The inference code receives images with a 640 × 480 resolution and generates segmentation masks for hurdles and horizontal bars. At long distances where horizontal bars are not identified, the geometric center of the entire mask, recognized as a hurdle, is set as the target point to guide the approach. At this time, the model sometimes misrecognizes background noise while failing to recognize hurdles. To address this, when the area recognized as a hurdle in the current frame is significantly smaller than in previous frames, this represents a misrecognition and is filtered out. Conversely, when entering close range and a horizontal bar is detected, the smallest rotated bounding box containing the largest area of the entire horizontal bar mask is generated, and the center coordinates and inclination of this box are used as control inputs. The physical distance to the target point is determined by the top 25% value among the depth values of each pixel within the area of the hurdle or horizontal bar. Figure 3 shows an example of the output data finally provided by the inference module. For camera images, the inference module provides the center coordinates of the hurdle or horizontal bar in the image, the angle of the rotated bounding box enclosing the horizontal bar, the short side length and long side length, and the distance value.

The main heuristic choices in Algorithm 1 were selected to stabilize control-relevant quantities rather than to optimize pixel-level segmentation quality. The history-based area filter rejects frames whose detected hurdle area becomes much smaller than the recent history mean. This rule is used only in the red-only branch, which typically corresponds to the far-range regime, where the hurdle occupies a small image region and isolated false positives often appear as tiny, short-lived masks. The factor

α = 0.1

was chosen as a conservative threshold that removes these transient detections without suppressing normal area fluctuations caused by moderate viewpoint change. The largest-component rule in the near-range branch is used because the controller needs only one bar toward which to guide the drone, and considering that all bar candidates can misdirect it when some are false positives. For distance estimation, we use the 25th percentile of masked depth values instead of the minimum or the mean. The minimum was too sensitive to sparse outliers, while the mean was biased by background pixels near mask boundaries. The lower percentile provided a more stable estimate of the front-side structure depth, which is the quantity most relevant for approach control.

Algorithm 1 Segmentation post-processing and control-state extraction
Require: Segmentation image $S \in R^{H \times W \times 3}$ (BGR), depth map $D \in N^{H \times W}$ (optional, 16UC1) Ensure: $(c_{x}, c_{y}, θ, s, l, d)$ Constants: history length $N = 20$ , noise factor $α = 0.1$ , depth percentile $q = 0.25$ , max samples $M_{max} =$ 20,000
1: */ persistent state /* $θ_{prev}$ (initialized to 0), history queue $H$ (keep last N)
2: Split S into channels $(B, G, R)$
3: $M_{G} \leftarrow G$ , $M_{R} \leftarrow (G \lor R)$	▹∨: per-pixel OR on 8-bit masks
4: $g C o u n t \leftarrow$ CountNonZero $(M_{G})$ , $r C o u n t \leftarrow$ CountNonZero $(M_{R})$
5: Append $r C o u n t$ to $H$ ; if $\| H \| > N$ , pop oldest
6: if $g C o u n t + r C o u n t = 0$ then return $(- 1, - 1, θ_{prev}, 0, - 1, 0)$
7: if $g C o u n t = 0 \land r C o u n t > 0$ then	▹ red-only branch (far range)
8: $a v g \leftarrow$ Mean $(H)$
9: if $a v g > 0 \land r C o u n t < α \cdot a v g$ then return $(- 1, - 1, θ_{prev}, 0, - 1, 0)$	▹ noise rejection
10: $(m_{00}, m_{10}, m_{01}) \leftarrow$ Moments $(M_{R})$
11: if $m_{00} \leq 0$ then return $(- 1, - 1, θ_{prev}, 0, - 1, 0)$
12: $c_{x} \leftarrow m_{10} / m_{00}$ , $c_{y} \leftarrow m_{01} / m_{00}, θ \leftarrow θ_{prev}$ , $s \leftarrow 0$ , $l \leftarrow 0$
13: $M_{target} \leftarrow M_{R}$
14: else	▹ green-present branch (near range)
15: $(l a b e l s, s t a t s, n L a b e l s) \leftarrow$ ConnectedComponentsWithStats $(M_{G}, 8 - conn)$
16: if $n L a b e l s \leq 1$ then return $(- 1, - 1, θ_{prev}, 0, - 1, 0)$
17: $L \leftarrow I (l a b e l s = arg {max}_{k \in {1, \dots, n L a b e l s - 1}} Area (k))$	▹ largest component mask, 8-bit
18: $P \leftarrow$ FindNonZero(L)	▹ list of pixel coordinates
19: if $P = \emptyset$ then return $(- 1, - 1, θ_{prev}, 0, - 1, 0)$
20: $R_{b o x} \leftarrow$ MinAreaRect(P)
21: $(c_{x}, c_{y}) \leftarrow center (R_{b o x}), (w, h) \leftarrow size (R_{b o x})$ , $s \leftarrow min (w, h)$ , $l \leftarrow max (w, h)$
22: $p_{i} \leftarrow points (R_{b o x})$ for $i \in {0, 1, 2, 3}$	▹ 4 vertices
23: */ slope computation exactly as implemented /*
24: if $s = h$ then	▹ short side equals height
25: $θ \leftarrow atan2 ((p_{0} . y + p_{1} . y) - (p_{2} . y + p_{3} . y), (p_{0} . x + p_{1} . x) - (p_{2} . x + p_{3} . x))$
26: else
27: $θ \leftarrow atan2 ((p_{0} . y + p_{1} . y) - (p_{2} . y + p_{3} . y), (p_{0} . x + p_{1} . x) - (p_{2} . x + p_{3} . x)) + π / 2$
28: end if
29: $θ \leftarrow wrap (θ, - π / 2, π / 2), θ_{prev} \leftarrow θ$
30: $M_{target} \leftarrow L$
31: end if
32: */ distance rule: masked depth percentile /*
33: if D not available then
34: $d \leftarrow 0$
35: else
36: $V \leftarrow {D (x, y) \| M_{target} (x, y) \neq 0 \land D (x, y) > 0}$
37: if $V = \emptyset$ then
38: $d \leftarrow 0$
39: else
40: if $\| V \| > M_{max}$ then
41: $V \leftarrow$ uniform subsample of V to size $M_{max}$
42: end if
43: $d \leftarrow N t h E l e m e n t (V, ⌊ q \cdot \| V \| ⌋)$	▹ value at rank k (no full sort)
44: end if
45: end if
46: return $(c_{x}, c_{y}, θ, s, l, d)$

4. Autonomous Perching Algorithm

The proposed autonomous perching system integrates vision-based state estimation and control on an F450-frame quadcopter. For outdoor flights, GPS is excluded, and position is estimated only through stereo-camera-based MSCKF-VIO, assuming a GNSS-denied environment. Figure 4 shows the complete hardware architecture and the interconnection of all components. Figure 5 shows the resulting F450-based platform used in the experiments.

An NVIDIA Jetson Orin NX serves as the mission computer for perception and high-level control, while a Pixhawk 6C handles low-level attitude control. Two RealSense stereo cameras provide visual information. One is fixed to the front of the vehicle for VIO. The other is mounted on a single-axis gimbal driven by an XL330 servo for hurdle tracking. A 1D LiDAR (Lidar Lite V3) and a ToF distance sensor (VL53L0X) are used for altitude measurement and short-range distance checking during perching, respectively. This sensor configuration was chosen to maximize portability and ease of integration. Stereo cameras, a gimbal, and a 1D LiDAR are common UAV components, so the platform is easy to reproduce. By adding only a ToF sensor to this standard setup, we keep hardware dependency low while preserving the sensing functions required for markerless perching. The ToF sensor is reserved for the final stage of approach. Its effective range is limited to about 30 cm because the hurdle bar is thin, rounded, and weakly reflective.

One important design choice is the placement of the 1D LiDAR at the rear of the drone rather than directly beneath its center. This choice reduces the risk of false altitude measurements during hurdle approach. If the LiDAR were mounted in a conventional downward-facing position under the vehicle center, its beam could intersect the hurdle during descent. The controller would then interpret the shortened range as rapid altitude loss. That false reading would trigger an unnecessary ascent command, as illustrated in Figure 6, and could produce unstable oscillatory motion. By mounting the LiDAR at the rear and adopting a forward-then-backward approach trajectory, we keep the LiDAR beam clear of the hurdle for most of the maneuver and maintain a more reliable altitude reference.

In this study, considering limited hardware resources and uncertain outdoor environments, the design requirements for the autonomous landing algorithm are specified as follows:

Robustness to High Position Uncertainty: This system relies entirely on VIO with possible accumulated errors, without absolute position sensors such as GPS or motion capture systems. Therefore, the control algorithm must have robust tracking performance that can tolerate the inevitably occurring drift and noise in position estimation [35].
Perching Phase Independent of Depth Information: During the descent phase, when the drone approaches the hurdle, depth information cannot be obtained because the target object enters within the stereo camera’s minimum measurement distance. Therefore, during the descent phase, precise control needs to be possible using only 2D image features and ToF sensor data without relying on depth information.
Lightweight Design for Edge Computing Environment: The Jetson Orin NX mounted on the drone needs to simultaneously perform VIO, object recognition, gimbal, and velocity control. Therefore, the entire algorithm requires lightweighting to operate stably within limited computational resources.
Response to Irregular Control Periods: Due to the high computational demands of VIO and the hurdle recognition model, the control loop period ( $Δ t$ ) may fluctuate severely. The algorithm needs to be designed to land autonomously and stably despite irregularities in control periods.

To satisfy the above requirements, this study proposes the state-machine control structure shown in Figure 7. The overall process consists of five stages. Prepare Reaching performs altitude adjustment and initialization. Reaching tracks and approaches the hurdle from long range. Perching descends precisely toward the horizontal bar from above. Landing stabilizes posture after the drone makes contact with and becomes supported by the bar. Searching scans the surroundings when recognition fails.

The state machine is adopted because it separates the theoretically grounded Reaching phases from the empirically tuned Perching phases. This separation simplifies both development and integration. It also allows each phase to be optimized independently while keeping the overall logic explicit. In particular, the Prepare Reaching stage is separated from Reaching so that altitude can be regulated by LiDAR before forward motion begins. If altitude were adjusted during forward approach, terrain-dependent LiDAR variations could cause unnecessary vertical corrections and make the vehicle altitude oscillate while it is moving toward the hurdle. Separating the stages, therefore, reduces the risk that VIO drift or terrain-induced vertical oscillation will destabilize the approach.

Across all stages, the gimbal command is updated continuously to track the hurdle. When the hurdle is not detected, the gimbal switches to sweeping motion. In Prepare Reaching, the drone mainly regulates altitude using the LiDAR measurement while keeping horizontal motion close to zero. In Reaching, it moves toward the hurdle with commanded x/y velocities derived from the hurdle position calculated from the camera measurements, while keeping vertical velocity at zero. In Perching, the y and yaw commands are chosen to center and align the hurdle in the image. The target gimbal angle on the desired trajectory is computed from the LiDAR-based altitude estimate. The x velocity is then determined from the difference between the target and current gimbal angles. The z velocity is commanded downward only when the drone is sufficiently close to the desired trajectory and satisfies the y/yaw alignment conditions; otherwise, the controller commands ascent or hover for recovery. In Searching, the drone suppresses aggressive translational motion and relies on gimbal sweeping to reacquire the target. The detailed control law for each stage is described below.

The drone body coordinate system is denoted as

{B}

, the gimbal coordinate system as

{G}

, the camera coordinate system as

{C}

, the virtual camera coordinate system obtained by translating the inertial coordinate system to be centered at the drone’s center as

{V}

, and the inertial coordinate system as

{A}

. Because the PX4 flight controller receives external command inputs through MAVLink, we use the MAVLink command convention in this study [36]. Due to the structure of the MAVLink interface, all drone velocity and angular velocity control commands are issued through the virtual camera coordinate system. For computational efficiency during close-range perching operations, where attitude variations are typically small, and the system operates in hover-like conditions, we employ a simplified coordinate transformation that neglects higher-order terms in the rotation matrix expansion. This approximation is valid when the drone maintains relatively stable attitude angles and operates at low translational and rotational velocities, which are characteristic conditions during the precision perching phase. The relationship between the drone body and virtual camera coordinate systems can be expressed as follows:

v_{V} = [\begin{matrix} cos ψ & - sin ψ & 0 \\ sin ψ & cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}] v_{B}, {\dot{ψ}}_{V} = {\dot{ψ}}_{B},

where

ψ

is the drone’s current yaw direction.

In all flight stages, the gimbal is controlled to continuously track the horizontal bar of the hurdle, and when the hurdle is not detected, it performs a sweeping motion in a designated pattern to maintain the field of view. Figure 8 shows the geometric relationship of the hurdle projected on the camera plane. When the center coordinates of the hurdle on the image plane are

(u, v)

, the vertical error

e_{v}

from the image center

(c_{x}, c_{y})

and the corresponding pitch angle error

ϕ_{y}

of the hurdle are as follows:

\begin{matrix} e_{v} & = v - c_{y}, \\ ϕ_{y} & = arctan (\frac{e_{v}}{f_{y}}), \end{matrix}

where

f_{y}

is the y-axis focal length of the camera. The pitch angle

θ_{hurdle}

of the hurdle with respect to the body coordinate system

{B}

is estimated as follows using the current gimbal pitch angle

θ_{cam}

.

θ_{hurdle} = θ_{cam} + ϕ_{y} .

Gimbal control uses PID control with anti-windup in the discrete-time domain. When defining the control error as

ϕ_{y}^{*}

, the calculation of each term and the final control output

δ θ_{cam}

are as follows:

\begin{matrix} ϕ_{y}^{*} [k] & = sat (ϕ_{y} [k], - k_{ϕ_{y}, sat}, k_{ϕ_{y}, sat}), \\ I_{ϕ} [k] & = sat (I_{ϕ} [k - 1] + ϕ_{y}^{*} [k] \cdot Δ t, - k_{i, cam, sat}, k_{i, cam, sat}), \\ D_{ϕ} [k] & = sat (\frac{ϕ_{y}^{*} [k] - ϕ_{y}^{*} [k - 1]}{Δ t}, - k_{d, cam, sat}, k_{d, cam, sat}), \\ δ θ_{cam} [k] & = k_{p, cam} ϕ_{y}^{*} [k] + k_{i, cam} I_{ϕ} [k] + k_{d, cam} D_{ϕ} [k] . \end{matrix}

The searching stage is a state where the structure has not been detected. The gimbal performs a sweeping motion at a constant speed to search for structures, and the drone’s target velocity is set to 0 in all directions. Gimbal angle updates use the following conditional incremental method.

θ_{cam} [k] = \{\begin{matrix} θ_{cam} [k - 1] + θ_{search, sweep} & if θ_{cam} [k - 1] < θ_{search, sweep, \max} \\ θ_{search, sweep, \min} & otherwise \end{matrix}

v_{V, x, search} = 0, v_{V, y, search} = 0, v_{V, z, search} = 0, {\dot{ψ}}_{V, search} = 0 .

In the prepare reaching stage, the drone’s altitude adjustment occurs. P control is performed on the altitude error for drone altitude adjustment. If the drone’s movement is sufficiently slow, the lidar measurement value can be approximated without transformation and used as altitude. The target velocity

v_{V, prepare_reach}^{*}

and rotational velocity

{\dot{ψ}}_{V, prepare_reach}^{*}

in the virtual camera coordinate system are calculated as follows:

\begin{matrix} v_{V, x, prepare_reach} & = 0, v_{V, y, prepare_reach} = 0, {\dot{ψ}}_{V, prepare_reach} = 0, \\ v_{V, z, prepare_reach} & = sat (K_{p, prepare_reach} e_{h e i g h t}, - k_{prepare_reach}, k_{prepare_reach}), \end{matrix}

where

e_{h e i g h t} = h_{prepare_reach} - z_{lidar}

is the altitude error calculated as the difference between the target altitude and the lidar-measured altitude. When the drone altitude

z_{lidar}

measured from the lidar satisfies

z_{lidar} > h_{reach}

, it transitions to the reaching stage, and when it satisfies

z_{lidar} > h_{search}

while the hurdle is not detected, it transitions to the searching stage.

In the reaching stage, when the image coordinates

(u, v)

of the hurdle horizontal bar, camera parameters

(f_{x}, f_{y}, c_{x}, c_{y})

, and distance

z_{c}

to the hurdle center obtained from stereo images are given, the position of the hurdle is as follows:

p_{c} = [\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \end{matrix}] = z_{c} [\begin{matrix} (u - c_{x}) / f_{x} \\ (v - c_{y}) / f_{y} \\ 1 \end{matrix}],

where

p_{c}

is the 3D position of the hurdle in the camera coordinate system. Applying rotation due to camera installation direction and rotation by the single-axis gimbal, we obtain the position of the hurdle in the body coordinate system

p_{B}

as follows:

p_{B} = [\begin{matrix} x_{B} \\ y_{B} \\ z_{B} \end{matrix}] = R_{B C} (θ_{cam}) p_{c},

with

\begin{matrix} R_{B C} (θ_{cam}) & = R_{B G} R_{G C} (θ_{cam}) \\ = [\begin{matrix} 0 & 0 & 1 \\ - 1 & 0 & 0 \\ 0 & - 1 & 0 \end{matrix}] [\begin{matrix} 1 & 0 & 0 \\ 0 & cos θ_{cam} & sin θ_{cam} \\ 0 & - sin θ_{cam} & cos θ_{cam} \end{matrix}], \end{matrix}

where

p_{B}

is the 3D position of the hurdle in the body coordinate system. In the reaching stage, altitude velocity is maintained at 0, and the drone approaches the target at a constant speed. To achieve this controlled approach, the target velocities

v_{B, reach}^{*}

and

{\dot{ψ}}_{B, reach}^{*}

are calculated as follows:

\begin{matrix} v_{B, reach}^{*} & = [\begin{matrix} v_{B, x, reach}^{*} \\ v_{B, y, reach}^{*} \\ v_{B, z, reach}^{*} \end{matrix}] = [\begin{matrix} v_{\max, reach} \frac{p_{B, x}}{ρ} \\ v_{\max, reach} \frac{p_{B, y}}{ρ} \\ 0 \end{matrix}], \\ {\dot{ψ}}_{B, reach}^{*} & = atan2 (p_{B, y}, p_{B, x}), \end{matrix}

where

ρ = \sqrt{p_{B, x}^{2} + p_{B, y}^{2}}

is the horizontal distance from the drone to the hurdle,

p_{B, x}

and

p_{B, y}

are the horizontal coordinates of the hurdle in the body coordinate system, and

v_{\max, reach}

is the maximum approach speed. P control is used in the reaching stage, and the control input

v_{B, reach}

and

{\dot{ψ}}_{B, reach}

are calculated as follows:

\begin{matrix} v_{B, reach} & = [\begin{matrix} k_{x, reach} v_{B, x, reach}^{*} \\ k_{y, reach} v_{B, y, reach}^{*} \\ 0 \end{matrix}], \\ {\dot{ψ}}_{B, reach} & = sat (k_{ψ, reach} {\dot{ψ}}_{B, reach}^{*}, - ω_{\max, reach}, ω_{\max, reach}), \end{matrix}

where

k_{x, reach}

,

k_{y, reach}

, and

k_{ψ, reach}

are the P control gains for each direction, and

ω_{\max, reach}

is the maximum rotational speed limit.

In the reaching stage, when the horizontal bar of the hurdle approaches below a certain distance, and the gimbal angle tracking the horizontal bar reaches above a certain level, it transitions to the perching stage as

θ_{cam} > θ_{start} \land \sqrt{p_{B, x}^{2} + p_{B, y}^{2} + p_{B, z}^{2}} < d_{start},

where

θ_{start}

is the threshold angle indicating that the gimbal is sufficiently pointing downward, and

d_{start}

is the threshold distance for target proximity determination. Also, if the hurdle recognition result has not been updated for

k_{update}

cycles, it transitions to the searching stage.

Figure 9 shows the flowchart of the perching stage. In the perching stage, as the drone approaches the hurdle, distance calculation through stereo matching becomes inaccurate, and segmentation performance degrades due to occlusion. Therefore, in the perching stage, control is performed without using depth information.

First, the hurdle horizontal bar pitch angle

θ_{hurdle}

by the gimbal camera and the gimbal target angle change

δ θ_{cam}

for the next time point are calculated. If the hurdle recognition result was updated more recently than

k_{update}

control cycles, the gimbal is moved by the corresponding gimbal target angle change. Otherwise, the distance to the bar measured from the distance sensor is checked. If the distance is below

d_{dist, sweep}

, the drone is already very close to the bar and is just before contact, currently directly above the bar, and landing is possible by descending, so the camera does not search for the bar again. Otherwise, the bar must be found again, so the gimbal performs sweeping while simultaneously applying the following velocity command to slowly raise the drone’s altitude:

v_{B, p e r c h} = [\begin{matrix} 0 \\ 0 \\ k_{sweep, perch} \end{matrix}] .

As a result, if the drone’s altitude becomes

z_{lidar, sweep}

or higher, it transitions to searching mode.

In the perching stage as well, if the structure is lost during descent, sweeping occurs as

θ_{cam} [k] = \{\begin{matrix} θ_{cam} [k - 1] + θ_{perch, sweep} & if θ_{cam} [k - 1] < θ_{perch, sweep, \max}, \\ θ_{perch, sweep, \min} & otherwise . \end{matrix}

In the perching stage, unlike the searching stage, the hurdle is likely to be below the drone, so unlike the searching stage, which performs sweeping while looking forward, the perching stage performs sweeping while looking at the ground.

If not yet transitioned to searching mode, the target velocity in perching mode is determined. The target velocity in y/yaw direction is determined through the x-coordinate and inclination of the hurdle horizontal bar in the image. To generate control input in a situation where the actual distance from the aircraft center to the horizontal bar is unknown, a projection point

p_{c}^{*}

in the camera coordinate system, assuming a virtual distance

d = 1

, is defined. The target angular velocity

{\dot{ψ}}_{B, perch}^{*}

in the perching stage assumes a situation where the aircraft is close to above the hurdle horizontal bar and the gimbal camera projects nearly perpendicular toward the ground (

θ_{c a m} \approx π / 2

). Under this assumption, the projection

p_{c}^{*}

and the corresponding body frame coordinate

p_{B, y}^{*}

are as follows:

p_{c}^{*} = [\begin{matrix} (u - c_{x}) / f_{x} \\ (v - c_{y}) / f_{y} \\ 1 \end{matrix}],

p_{B, y}^{*} = [\begin{matrix} 0 & 1 & 0 \end{matrix}] R_{B G} R_{G C} (θ_{cam}) p_{c}^{*} = - \frac{u - c_{x}}{f_{x}},

where

R_{B G}

and

R_{G C} (θ_{cam})

are the rotation matrices defined in the reaching stage. The target velocity in the perching stage

v_{B, perch}^{*}

and the target angular velocity

{\dot{ψ}}_{B, perch}^{*}

are set as follows:

v_{B, perch}^{*} = [\begin{matrix} v_{B, x, perch}^{*} \\ p_{B, y}^{*} \\ v_{B, z, perch}^{*} \end{matrix}],

{\dot{ψ}}_{B, perch}^{*} = γ,

where

p_{B, y}^{*}

is the desired position error in the y direction, and

γ

is the inclination of the hurdle horizontal bar in the image. The velocity vector

v_{B, perch}

is input to the PID controller.

The target velocity in the

x / z

directions is determined by several factors: the observed hurdle angle

θ_{hurdle}

, the target bar angle

θ_{hurdle}^{*} (z_{lidar})

at the current altitude, the LiDAR altitude

z_{lidar}

, the LiDAR+IMU corrected altitude

z_{filtered}

, the y/yaw alignment state, and the status of the perch_failed flag. The perch_failed flag is asserted when a false landing is attempted. In that case, the drone slowly reverses while climbing to

z_{lidar, perch_failed}

. If the perch_failed flag is asserted, the target velocity in the drone body frame is

v_{B, perch}^{*} = [\begin{matrix} - k_{x, perch_failed, perch} \\ v_{B, y, perch}^{*} \\ k_{z, perch_failed, perch} \end{matrix}]

is used. The perch_failed flag is released when

z_{lidar} > z_{lidar, perch_failed}

.

If the drone altitude

z_{lidar}

falls below

h + d_{lidar, offset}

, the drone is in one of three states: it has landed on the bar, it has failed to land, or LiDAR occlusion has occurred because the beam hit the hurdle bar. If the distance sensor measurement satisfies

d_{dist} < d_{dist, land}

, the drone is judged to have landed on the bar, and the target velocity is

v_{B, perch}^{*} = [\begin{matrix} - k_{x, land, perch} \\ v_{B, y, perch}^{*} \\ k_{z, land, perch} \end{matrix}]

and the landing counter is increased. Otherwise, if

| z_{filtered} - z_{lidar} | > h_{occlusion}

, LiDAR occlusion is assumed to have occurred. Because the LiDAR is placed at the rear bottom of the drone, the drone must reverse to escape this state. The target velocity then becomes

v_{B, perch}^{*} = [\begin{matrix} - k_{occlusion, perch} \\ v_{B, y, perch}^{*} \\ 0 \end{matrix}]

is set. If neither case applies, a false landing was attempted. In this case, the perch_failed flag is asserted, and the target velocity at this time point is

v_{B, perch}^{*} = [\begin{matrix} 0 \\ v_{B, y, perch}^{*} \\ k_{z, perch_failed_flag, perch} \end{matrix}]

to attempt ascent.

Figure 10 shows the target trajectory used in the perching stage. In the figure, the drone approaches from the right, and the horizontal bar of the hurdle is located at

(0, 0, h)

, where h is the hurdle height and w is the distance between the drone center and the camera. The perching target trajectory is defined as

(f (z), 0, z)

. This trajectory is designed to keep the rear-mounted LiDAR beam from being occluded by the hurdle during descent and to reduce rear-looking views of the gimbal camera. This is important because, during the final approach, rear-looking views increase the likelihood that the drone body enters the image, and we empirically observed that the recognition model can falsely detect parts of the airframe, particularly the landing gear, as hurdle-like structures. Reducing such views, therefore, improves visual recognition reliability in the final approach.

For the above reasons, the trajectory of descending from the front of the hurdle, moving forward directly above the horizontal bar, and then reversing,

f (z) = - \frac{0.45}{e^{- 7 (z - (h + 0.22))} + e^{7 (z - (h + 0.22))}} + 0.092

is adopted. The functional form was chosen to realize a smooth forward-then-backward motion with a single transition around the bar height; in particular, the shape of

1 / (e^{- x} + e^{x})

was adopted because it naturally yields a continuous profile that first advances and then retreats as altitude changes. Within this functional form, the numerical coefficients were then empirically tuned through preliminary flight tests so as to reduce the total duration of the landing process while preserving successful perching behavior. At each z,

tan (π - θ_{hurdle}^{*} (z)) = \frac{z - h}{f (z) + w} .

so,

θ_{hurdle}^{*} (z) = π - {tan}^{- 1} (\frac{z - h}{f (z) + w}) .

This is used as the target gimbal angle according to the drone’s altitude.

The target velocity in the x direction is provided from the altitude measured by the lidar and the current angle of the gimbal. Specifically, this velocity component

v_{B, x, perch}^{*}

is calculated using the difference between the target and current gimbal angles as follows:

v_{B, x, perch}^{*} = k_{x, perch} z_{lidar} (θ_{hurdle}^{*} (z_{lidar}) - θ_{hurdle}),

where

z_{lidar}

is the altitude measurement of the lidar.

The target velocity in the z direction has a constant value depending on whether the bar’s y/yaw is aligned and whether the gimbal angle is in the target range. If the bar’s y/yaw is not aligned or the gimbal angle significantly deviates from the target range, the z direction target velocity is set so that the drone altitude becomes

h_{hover}

. Otherwise, the drone stops or descends. The target velocity in the z direction

v_{B, z, perch}^{*}

is calculated as follows:

\begin{matrix} v_{B, z, perch}^{*} & = \\ \{\begin{matrix} k_{z 1, perch} (h_{hover} - z_{lidar}), & | γ | > γ_{max} \lor | p_{B, y}^{*} | > k_{y p b, perch} z_{lidar} \\ - k_{z 2, perch}, & d_{dist} < z_{lidar} + d_{dist, offset}, z_{lidar} < z_{lidar, \min} \\ k_{z 3, perch} (h_{hover} - z_{lidar}), & k_{θ 1} > Δ θ \lor Δ θ < - k_{θ 1} \\ - k_{z 4, perch}, & k_{θ 2} < Δ θ \leq k_{θ 1} \lor - k_{θ 1} \leq Δ θ < - k_{θ 2} \\ - k_{z 5, perch}, & otherwise \end{matrix}, \end{matrix}

where

Δ θ = θ_{cam} - θ_{hurdle}^{*} (z_{lidar})

.

As the drone descends, if the distance sensor measurement and drone altitude satisfy the conditions, the landing counter is incremented. Otherwise, the landing counter is reset to 0. So,

\begin{matrix} perch_counter \leftarrow \\ \{\begin{matrix} perch_counter + 1, & \begin{matrix} (v_{B, z, perch}^{*} \in {- k_{z 2, perch}, - k_{z 5, perch}} \\ \lor d_{dist} < d_{dist, land}) \\ \land (z_{lidar, counter, \min} \leq z_{lidar} \leq z_{lidar, counter, \max}) \\ \land (d_{dist} < d_{dist, perchcount}) \end{matrix} \\ 0, & otherwise \end{matrix} . \end{matrix}

Subsequently, transformation to the inertial coordinate system and PID control are performed. The final velocity command is,

\begin{matrix} I_{v_{B}^{*}} [k] & = sat (I_{v_{B}^{*}} [k - 1] + v_{B, perch}^{*} [k] Δ t, - k_{i, sat, perch}, k_{i, sat, perch}), \\ D_{v_{B}^{*}} [k] & = \frac{v_{B, perch}^{*} [k] - v_{B, perch}^{*} [k - 1]}{Δ t} \\ v_{B, perch} [k] & = sat (K_{p, perch} v_{B, perch}^{*} [k] + K_{i, perch} I_{v_{B}^{*}} [k] + K_{d, perch} D_{v_{B}^{*}} [k], \\ - k_{sat, perch}, k_{sat, perch}) \\ I_{{\dot{ψ}}_{B}^{*}} [k] & = sat (I_{{\dot{ψ}}_{B}^{*}} [k - 1] + {\dot{ψ}}_{B, perch}^{*} [k] Δ t, - k_{ψ, i, sat, perch}, k_{ψ, i, sat, perch}), \\ D_{{\dot{ψ}}_{B}^{*}} [k] & = \frac{{\dot{ψ}}_{B, perch}^{*} [k] - {\dot{ψ}}_{B, perch}^{*} [k - 1]}{Δ t} \\ {\dot{ψ}}_{B, perch} [k] & = sat (k_{p, ψ, perch} {\dot{ψ}}_{B, perch}^{*} [k] + k_{i, ψ, perch} I_{{\dot{ψ}}_{B}^{*}} [k] + k_{d, ψ, perch} D_{{\dot{ψ}}_{B}^{*}} [k], \\ - k_{ψ, sat, perch}, k_{ψ, sat, perch}) . \end{matrix}

When the value of the perching counter perch_counter satisfies

perch_counter \geq k_{perch_counter}

, it transitions to the landing stage. The landing stage is the post-contact stage in which the drone remains seated on the horizontal bar. For generality, we write the target velocity as

v_{V, l a n d} = {[0 0 - k_{z, land}]}^{T}

. An optional small downward bias was considered in the early design stage to press the vehicle slightly onto the bar. In the final implementation, however, we set

k_{z, land} = 0

, because once the vehicle is already supported by the bar, a persistent negative vertical-velocity command is not physically realizable and may adversely affect the internal PX4 velocity PID controller. Accordingly, in the implemented system, the landing stage applies zero translational velocity and relies on the PX4 attitude controller to maintain balance.

5. Results

This section is organized into three parts. We first evaluate the hurdle recognition model, then analyze end-to-end autonomous perching performance, and finally examine representative failure cases that illustrate practical limitations and motivate the main design choices.

5.1. Accuracy of Hurdle Recognition Model

Before evaluating the integrated perching system, we first analyze the detection accuracy of the enhanced YOLOv12n-based hurdle recognition model across several datasets and scenarios. This subsection establishes the perception capability on which the later control evaluation depends. This subsection includes two main elements. We first verify that the proposed architectural changes improve general detection quality on standard aerial-imaging benchmarks. We then examine hurdle-specific recognition performance on our custom dataset. To quantify the effect of each proposed model component, we report three training settings. First, in the COCO-to-VisDrone setting, each comparison model is pre-trained on COCO for 100 epochs, and the resulting COCO mAP50(M) is reported after pre-training. Although this is not the final task performance, it is still useful for comparing how structural changes affect general visual representation learning. Because VisDrone does not provide segmentation labels, the segmentation head is replaced with YOLOv12’s detection head, while all remaining weights are retained except for the detection head. These head-replaced, COCO-pre-trained models are then additionally trained on VisDrone for 200 epochs to calculate mAP50(M) and mAP50:95(M). Second, in the COCO-to-hurdle setting, the COCO-pre-trained models are additionally trained on the hurdle dataset for 140 epochs, and performance is evaluated by calculating precision and recall for each of the hurdle (whole) and horizontal bar (part) classes. For these two trainings, we use a batch size of 16, the auto optimizer setting, and overlap_mask=false to prevent the hurdle and bar masks from being merged when they overlap spatially. All other hyperparameter settings follow the default parameters of the yolo library.

Third, in the VisDrone-scratch setting, the models are trained on VisDrone from the beginning without COCO pre-training. In this third setting, the proposed model’s Segment head is replaced with YOLOv12’s Detection head and trained on the VisDrone dataset for 300 epochs. The input image size is

640 \times 640

, the batch size is 8, and the optimizer is SGD with learning rate 0.01 and momentum 0.937. The remaining hyperparameter settings follow the default parameters of the yolo library. Additionally, in model comparisons, not only accuracy metrics but also computational complexity, such as computational load (GFLOPs), are presented together, enabling simultaneous discussion of the performance-efficiency trade-off.

Hurdles photographed from long distances appear as very small objects in the image, and YOLO-series segmentation models tend to overestimate their masks by including surrounding background [37]. In our application, however, the main purpose of the recognition model is to estimate the hurdle center and inclination for control. Stable center estimation is therefore more important than pixel-accurate mask boundaries. For this reason, performance evaluation on the hurdle dataset adopts the matching-tolerance concept from BSDS500 [38], a boundary-detection benchmark that calculates IoU after expanding both the ground-truth and predicted masks by a fixed amount. Specifically, when the image diagonal length is

d = \sqrt{H^{2} + W^{2}}

, where H and W are the image height and width, the tolerance radius is defined as

δ = 0.0075 d .

In the input resolution (

640 \times 480

) of this experiment,

d ≃ 800

, so

δ ≃ 6

pixels. This

δ

is reflected as one dilation using an elliptic structural element in the code implementation. At this time, for a binary mask

M \subset Ω

(

Ω

is the pixel grid) and structural element

B_{δ}

, the dilation operation is denoted as

D_{δ} (M) = M \oplus B_{δ} .

Subsequently, the relaxed IoU is calculated as the IoU of the dilated masks as

{IoU}_{δ} (M_{gt}, M_{pred}) = \frac{|D_{δ} (M_{gt}) \cap D_{δ} (M_{pred})|}{|D_{δ} (M_{gt}) \cup D_{δ} (M_{pred})|} .

In this study, when

{IoU}_{δ} > 0.5

is satisfied, this constitutes “detection success”.

Additionally, when multiple hurdles exist in an image or parts of hurdles are occluded, defining TP/FP/FN at the instance level using only pixel-wise IoU becomes difficult. Therefore, this study extracted connected components based on 8-connectivity from the dilated masks and performed region-level matching. Each predicted region selects the item with maximum IoU among ground truth regions not yet matched, and if that value is above the threshold

τ = 0.5

, it is matched as TP. Unmatched predictions are counted as FP, and unmatched ground truths as FN. Final precision/recall is calculated as

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN} .

In autonomous landing problems, temporary non-detection of hurdles can be resolved by the passage of time or movement of the drone, but incorrectly detecting structures that are not hurdles creates movement to the wrong locations. Consequently, in this study, precision is considered the highest priority, followed by recall. If the center coordinates and inclination of objects recognized by the model are accurate, the shape of the segment is relatively less important for drone control, so mAP50 is used only as a reference measure.

Table 1 reports the COCO pre-training results together with the results of the first setting, namely, additional training on VisDrone after COCO pre-training. The listed computational costs correspond to the final detection and segmentation configurations used for VisDrone and hurdle evaluation, respectively. To isolate the effect of each architectural change, we also include an intermediate model in which only the P2 path is added to the YOLOv12 baseline. Adding the P2 path improves both COCO performance after pre-training and VisDrone performance, suggesting benefits for both general feature learning and small-object detection. The subsequent lightweighting step recovers 0.6/0.4 GFLOPs out of the 1.0/0.8 GFLOPs increase introduced by the P2 path, while still providing additional performance gains on VisDrone. These COCO-VisDrone benchmark-level results are important because they show that the proposed modifications are not only tuned to the custom hurdle dataset. Instead, they improve the model in a way that remains visible on broader aerial-imaging tasks. This makes the later hurdle-specific analysis more convincing.

Table 2 shows the performance of scratch training on VisDrone without COCO pre-training. Examples of recognition results of the three models are shown in Figure 11. The proposed model outperforms the baseline YOLOv12n and also shows better performance than YOLOv8n. The scratch-training results lead to the same conclusion as the pre-trained comparison. The proposed architecture retains an accuracy advantage even without relying on COCO initialization. This consistency suggests that the gain arises from the architecture itself rather than from COCO initialization alone.

We next examine COCO pre-trained hurdle-dataset results in order to compare the practical effects of negative-sample training, whole-part joint learning, and the proposed architectural modifications under the target task itself. This comparison is especially relevant because these choices primarily affect the precision–recall trade-off that matters during real flight.

Table 3 summarizes the effect of model structure and training strategy on hurdle recognition. Training with negative samples causes a slight decrease in recall and mAP50, but it improves precision substantially, from 0.744 to 0.819 for hurdles and from 0.934 to 0.969 for horizontal bars. This result indicates that negative samples effectively suppress confusion with hurdle-like background structures. This precision gain is operationally important. In autonomous perching, a false positive can direct the vehicle toward a dangerous or irrelevant structure. By contrast, a temporary missed detection can often be recovered by continued motion or target reacquisition.

Compared with the single-class setting that detects only the horizontal bar, the joint setting that detects both the hurdle and the bar maintains bar recall at 0.792 while slightly improving precision from 0.962 to 0.969 and mAP50 from 0.775 to 0.783. This suggests that multi-class learning can exploit the structural context of the whole hurdle without degrading bar-recognition performance, which is consistent with the whole–part recognition strategy adopted in the controller design.

The proposed structural modifications increase computational load by approximately 5.8%. Even so, they improve mAP50 at the COCO pre-training stage by about 1 percentage point and improve hurdle precision from 0.819 to 0.884, a gain of about 6.5 percentage points. These results suggest that the high-resolution P2 path contributes meaningfully to recognizing small hurdles at long range. Horizontal-bar performance degrades slightly, but this trade-off is acceptable for our system. The controller uses the whole hurdle during the long-range approach and switches to the horizontal bar only at close range. Most recognition failures occur during long-range hurdle detection. For that reason, the gain in hurdle recognition is more important than the small loss in bar performance. Overall, these results support the proposed model as the perception front end for the full autonomous perching pipeline. The aggregate results above suggest that the main benefit of the proposed architecture should appear in the long-range regime, where hurdles occupy only a small portion of the image. To test this point directly, we split the hurdle-dataset samples into cases in which the hurdle occupies more than 1/9 of the image area and cases in which it occupies less than 1/100, corresponding approximately to close-range and long-range operation, respectively. The table below shows that the structural modifications mainly improve performance in the small-hurdle regime.

Table 4 confirms that the advantage of the proposed model appears mainly in the small-hurdle regime, whereas all models perform similarly when the hurdle is already large in the image. This is the more important regime for our application because long-range recognition strongly affects whether the subsequent approach can proceed stably.

For small hurdles, the proposed model improves precision from 0.645 to 0.767 and recall from 0.710 to 0.812 relative to YOLOv12n, while mAP50(M) increases from 0.685 to 0.774. YOLOv8n also improves small-hurdle performance relative to YOLOv12n, but it requires 8.2 GFLOPs, compared with 6.5 for YOLOv12n and 6.9 for the proposed model. The proposed model, therefore, provides the best balance between long-range hurdle recognition and onboard computational cost, which supports the benefit of the added high-resolution P2 feature level.

For small bars, the proposed structural modifications do not produce a clearly meaningful performance gain. This is not unexpected, because the simple shape of a small horizontal bar makes it easy to confuse with similar objects such as branches or drainpipes, and at long range, the loss of fine visual detail makes it blend more easily into the background. In practice, however, this limitation is acceptable for our system, because the controller mainly relies on whole-hurdle recognition during the long-range approach stage and switches to bar-based guidance only after the vehicle has moved sufficiently close to the target.

Figure 12 shows the hurdle recognition results of the proposed model. The inference results corresponding to the original images at the top are presented as mask images at the bottom, with the entire hurdle area visualized in red, the horizontal bar area in green, and the overlapping area of the two in yellow. The examples show that even when multiple hurdles exist within the field of view, each object is clearly separated and recognized. In the shown cases, structures visually similar to hurdles are not falsely detected, and hurdle components remain detectable even at long range or under partial occlusion. Taken together, these results show that the proposed hurdle-recognition model can reliably detect the whole hurdle at long range while suppressing false recognition of hurdle-like structures.

5.2. Accuracy of Autonomous Perching Algorithm

Having established the detection capability of the recognition model, we now evaluate the end-to-end performance of the complete autonomous perching system through real outdoor flight experiments. This subsection analyzes six outdoor flight experiments designed to evaluate both the overall performance of the proposed system and the contribution of its main components. All experiments use a scenario in which the drone takes off and attempts to land on a hurdle located approximately 10 m ahead. To avoid overfitting the evaluation to a single initial pose, the takeoff position is randomized within a radius of 1 m around a nominal point 10 m in front of the hurdle. This setup allows us to assess robustness under realistic initial-condition variation.

For performance evaluation, we distinguish between ‘perching’ and ‘landing’ in order to separate initial bar contact from subsequent stable post-contact behavior. This distinction is important because successful contact does not necessarily guarantee a successful landing; posture misalignment can still cause failure, and the sensors may also fail to detect that the drone is already seated on the hurdle, which can trigger a retry.

The autonomous perching algorithm uses the parameter set summarized in Table 5. The camera parameters

(f_{x}, f_{y}, c_{x}, c_{y})

, where

f_{x}

and

f_{y}

denote the focal lengths and

c_{x}

and

c_{y}

denote the principal point coordinates, use the default RealSense calibration values. The proposed system is designed for a standardized F450 frame, and the control parameters were determined through real-flight experiments.

Although the parameters in Table 5 are empirical, they are not arbitrary. The transition thresholds are tied to sensing validity and vehicle geometry. For example,

d_{start}

and

θ_{start}

are selected so that the controller switches to perching only after the hurdle is sufficiently large in the image and the gimbal angle indicates that the vehicle is close enough for bar-centered alignment. The altitude-related thresholds are likewise set with consideration of the hurdle height, landing-gear clearance, and the rear-mounted LiDAR offset, which helps reduce the likelihood of incorrect decisions during the perching stage. The perching-stage gains were then empirically fixed through extensive parameter exploration in real-flight experiments so as to provide stable descent and alignment behavior during perching.

Performance is evaluated by dividing the flight process into three stages: ‘reaching’, ‘perching’, and ‘landing’. For each stage, we record success or failure and the corresponding time distribution. Here, the ‘perching’ stage means that the rectangular area formed by the four landing-gear endpoints is in physical contact with the hurdle bar, but stable landing is not yet guaranteed. The ‘landing’ stage is defined as the earliest time point at which, after the drone has descended onto the bar, continuous contact between the bottom surface of the drone and the bar begins to be maintained. Therefore, inadequate posture alignment during perching can still lead to landing failure and a subsequent re-approach.

Table 6 summarizes the experimental results. Stage durations are analyzed only for successful trials. A low perching success rate relative to the reaching success rate indicates tracking failure during descent. In most cases, this failure is caused by rapid changes in the hurdle’s gaze angular velocity. A low landing success rate relative to the perching success rate indicates failure after contact. This usually means that the drone reached the bar with an unsuitable posture. The full system, which combines the proposed hurdle recognition model with the complete control algorithm, achieves the highest landing success rate. By contrast, YOLOv12n without negative-sample training sometimes drives the drone away from the correct path and increases total completion time. These results show that both model design and dataset design contribute directly to practical perching performance. For the full configuration, the landing success rate is 6/6 (exact 95% binomial CI: 54.1–100%). The most important outcome of Table 6 is that the proposed hurdle recognition model is best not only in perception metrics but also in end-to-end flight success. This is critical because the purpose of the model is not standalone recognition. Its purpose is to support a stable and safe control loop during the full maneuver.

We also performed an ablation study to isolate the contribution of each major control component. First, even at low close-range speed, the gimbal can lose the target because the hurdle’s gaze angular velocity changes rapidly during descent. When gimbal sweeping is removed, the system cannot reacquire the hurdle quickly. This reduces landing success and increases completion time. Second, removing the y-axis and yaw-alignment process accelerates descent, but it also increases landing failures and retries. In that setting, the drone reaches the bar with a less stable posture. This effect appears as larger time variance and lower success rates. Third, when the perching trajectory is simplified to

f (t) = 0

, rear-looking views occur more frequently. In these views, the camera can observe drone structures such as the landing gear. That increases misrecognition and can drive the drone toward incorrect locations during descent. As a result, the landing success rate drops sharply. These ablation results clarify why the full configuration must combine recognition robustness, active target reacquisition, and posture-stabilizing descent logic. The ablation results clarify that reliable perching requires the perception model, the target-reacquisition mechanism, and the trajectory/alignment logic to work together. Accordingly, the proposed system is evaluated as an integrated perception–control framework rather than as a detector and controller considered separately.

Figure 13 presents a chronophotographic view of the first full-configuration experiment. It shows the drone’s position and attitude changes throughout the maneuver. During the reaching stage, the drone moves quickly toward the hurdle. During the perching stage, it follows the proposed forward-then-backward descent trajectory. A small overshoot caused by inertia is visible immediately after the drone reaches the area above the hurdle. This qualitative result is consistent with the quantitative timing and success-rate data above.

Figure 14a shows an enlarged chronophotographic view of the drone near the hurdle during one of the full-configuration experiments. Figure 14b shows the 10 s interval immediately after the drone establishes stable contact with the horizontal bar. The visualization overlays eight frames extracted from a 10 s video. It shows that the proposed system can maintain posture on the narrow bar using visual feedback alone, without a separate physical fixing device such as a gripper.

Figure 15 shows the execution period of each algorithm measured during 10 s of autonomous perching operation. The recognition module, including hurdle detection and pose estimation, operates with an average period of 81.54 ms, while the autonomous perching control algorithm operates with an average period of 33.32 ms. These results confirm that the full perception-control loop runs fast enough for onboard deployment in the tested setting.

5.3. Failure Cases

Although the proposed system achieved high success rates in the main experiments, quantitative success rates alone do not fully explain the role of each design choice. For that reason, this subsection presents two representative failure scenarios observed during the ablation study. These cases complement the numerical results above and clarify why false-positive suppression and gimbal sweeping are essential in practice.

During ablation experiments with the standard YOLOv12 model and without negative-sample training, the recognition system occasionally misidentified background structures, especially trees and poles, as hurdle targets. Figure 16 shows one such case. After losing sight of the real hurdle, the drone abruptly changed direction toward a background tree. This behavior highlights the importance of suppressing false positives through negative-sample training. For safety, the drone was forced to land manually just before impact. This failure mode also explains why precision is prioritized over recall in our evaluation. A temporary missed detection can often be recovered through continued flight or gimbal sweeping. A confident false detection, in contrast, can drive the drone toward a dangerous target.

The ablation study without gimbal sweeping revealed a clear limitation in target reacquisition. Figure 17 shows a drone wandering above the hurdle after losing visual contact. Near the hurdle, the target can move rapidly in the camera image because its apparent angular velocity becomes large. Without gimbal sweeping, the system cannot quickly recover a target that leaves the field of view. This problem is especially severe when the drone is already close to the hurdle, because small vehicle motions then produce large image-space shifts. In this failure mode, the drone remained above the hurdle area without completing the perching sequence. The gimbal sweeping mechanism is therefore essential for recovering from temporary tracking loss and continuing the maneuver.

Taken together with the earlier accuracy, flight-test, and timing results, these failure cases complete the system-level argument of the Results section. The proposed framework is not only accurate and effective but also executable in real time on the onboard computing platform used in the experiments, while the ablations clarify the practical consequences of removing key design elements.

6. Conclusions

This paper presented an onboard perching system that enables a quadrotor UAV to perch autonomously on the horizontal bar of a markerless hurdle in an uncooperative outdoor environment. The proposed system integrates perception, post-processing, and control within a single framework that runs in real time on resource-constrained onboard hardware.

The experimental results support the effectiveness of this integrated design. In the small-hurdle regime, the proposed perception model improves hurdle detection precision and recall by 12.2 and 10.2 percentage points, respectively, relative to the YOLOv12n baseline. In outdoor flight tests, the full system achieves 6/6 landing success. The perception module runs with an average period of 81.54 ms on a Jetson Orin NX. The ablation studies further confirm the roles of negative samples, the P2 path, gimbal sweeping, the perching trajectory, and alignment control. Taken together, these results show that a lightweight integrated perception–control framework can enable practical markerless hurdle perching on onboard hardware and remain feasible in real outdoor flight on thin 3D structures under realistic sensing constraints.

Despite these results, several limitations remain. First, many control parameters still depend on empirical tuning for the F450-based platform. This may limit reproducibility across different platforms, payloads, sensor configurations, and vehicle dimensions. Because the proposed controller sends high-level velocity commands through PX4, moderate changes in total vehicle mass and weak variations in overall vehicle dimensions are expected to have limited influence on the upper-level control logic when the underlying PX4 velocity controller is well tuned. However, geometric quantities that enter the control law explicitly, such as the distance between the gimbal camera and the drone center, may be more sensitive to platform changes. Wind and temperature effects may also be partly absorbed by the lower-level PX4 controller, but the quantitative sensitivity of the full system to these factors has not yet been analyzed. Second, the current experiments do not yet cover a sufficiently broad range of illumination changes, wind disturbances, cluttered backgrounds, obstacle conditions, and dynamic targets for a full robustness and generalization assessment. In particular, performance on moving targets, including slowly moving ones, has not been directly validated. Third, both the hurdle dataset and the experimental scale remain limited relative to what would be needed for stronger statistical and domain-generalization claims. As a result, the manuscript does not yet provide sufficient evidence of robustness and generalization to support stronger deployment-oriented claims. In particular, six flight tests per condition are not sufficient for definitive statistical validation of robustness and generalization.

Future work should therefore focus on three directions. First, manual parameter dependence should be reduced through systematic sensitivity analysis and by quantifying how performance changes with vehicle weight, payload, and geometric variations. Second, flight validation should be expanded with more trials per condition and with more diverse environments, including varying illumination, wind disturbances, complex visual backgrounds, obstacle conditions, and slowly moving targets. Third, the dataset should be enlarged to include more diverse structures and appearances so that model generalization can be assessed more rigorously. Extensions to more complex 3D structures, non-rigid targets, and multi-drone perching scenarios are also promising directions.

Overall, the proposed system shows that markerless outdoor perching on a thin hurdle bar is feasible with a lightweight onboard perception–control pipeline. The methods and experimental findings provide a practical basis for future UAV perching systems intended for long-endurance missions such as infrastructure inspection, surveillance, and autonomous charging or resting operations.

Author Contributions

Conceptualization, D.K.; methodology, D.K.; software, D.K.; validation, D.K.; formal analysis, D.K.; investigation, D.K.; resources, D.K.; data curation, D.K.; writing—original draft preparation, D.K.; writing—review and editing, D.K. and D.E.C.; visualization, D.K.; supervision, D.E.C.; project administration, D.E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00469/RS-2022-II220469, Development of Core Technologies for Task-oriented Reinforcement Learning for Commercialization of Autonomous Drones).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
YOLO	You Only Look Once
COCO	Common Objects in Context
mAP	mean Average Precision
IoU	Intersection over Union
PID	Proportional-Integral-Derivative
LiDAR	Light Detection and Ranging
ToF	Time of Flight
IMU	Inertial Measurement Unit
FPS	Frames Per Second
GNSS	Global Navigation Satellite System
RGB	Red Green Blue

References

Meng, J.; Buzzatto, J.; Liu, Y.; Liarokapis, M. On aerial robots with grasping and perching capabilities: A comprehensive review. Front. Robot. AI 2022, 8, 739173. [Google Scholar] [CrossRef]
Xin, L.; Tang, Z.; Gai, W.; Liu, H. Vision-based autonomous landing for the uav: A review. Aerospace 2022, 9, 634. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient small object detection for remote sensing image using enhanced RT-DETR model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Askari, M.; Benciolini, M.; Phan, H.V.; Stewart, W.; Ijspeert, A.J.; Floreano, D. Crash-perching on vertical poles with a hugging-wing robot. Commun. Eng. 2024, 3, 98. [Google Scholar] [CrossRef]
Hsiao, Y.H.; Bai, S.; Zhou, Y.; Jia, H.; Ding, R.; Chen, Y.; Wang, Z.; Chirarattananon, P. Energy efficient perching and takeoff of a miniature rotorcraft. Commun. Eng. 2023, 2, 38. [Google Scholar] [CrossRef]
Lussier Desbiens, A.; Cutkosky, M.R. Landing and perching on vertical surfaces with microspines for small unmanned air vehicles. J. Intell. Robot. Syst. 2010, 57, 313–327. [Google Scholar] [CrossRef]
Prachand, C.; Rustagi, R.; Shankar, R.; Singh, J.; Abhishek, A.; Venkatesh, K. Vision-Based Autonomous Ship Deck Landing of an Unmanned Aerial Vehicle Using Fractal ArUco Markers. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 2345. [Google Scholar]
Nguyen, P.T.; Westerlund, T.; Peña Queralta, J. Vision-based safe autonomous UAV docking with panoramic sensors. Front. Robot. AI 2023, 10, 1223157. [Google Scholar] [CrossRef]
Liu, X.; Xue, W.; Xu, X.; Zhao, M.; Qin, B. Research on unmanned aerial vehicle (UAV) visual landing guidance and positioning algorithms. Drones 2024, 8, 257. [Google Scholar] [CrossRef]
Wu, H.; Wang, W.; Wang, T.; Suzuki, S. High-Precision Landing on a Moving Platform Based on Drone Vision Using YOLO Algorithm. Drones 2025, 9, 261. [Google Scholar] [CrossRef]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. [Google Scholar]
He, G.; Jangir, Y.; Geng, J.; Mousaei, M.; Bai, D.; Scherer, S. Image-based visual servo control for aerial manipulation using a fully-actuated UAV. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 5042–5049. [Google Scholar]
Chen, C.W.; Hung, H.A.; Yang, P.H.; Cheng, T.H. Visual servoing of a moving target by an unmanned aerial vehicle. Sensors 2021, 21, 5708. [Google Scholar] [CrossRef]
Wu, W.; Liu, A.; Hu, J.; Mo, Y.; Xiang, S.; Duan, P.; Liang, Q. EUAVDet: An efficient and lightweight object detector for UAV aerial images with an edge-based computing platform. Drones 2024, 8, 261. [Google Scholar] [CrossRef]
Xiao, Y.; Di, N. SOD-YOLO: A lightweight small object detection framework. Sci. Rep. 2024, 14, 25624. [Google Scholar] [CrossRef] [PubMed]
Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
Zhou, J.; Su, T.; Li, K.; Dai, J. Small target-YOLOv5: Enhancing the algorithm for small object detection in drone aerial imagery based on YOLOv5. Sensors 2023, 24, 134. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Zhang, F.; Zhang, Y.; Liu, Y.; Cheng, T. Lightweight object detection algorithm for uav aerial imagery. Sensors 2023, 23, 5786. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Shi, H.; Yang, W.; Chen, D.; Wang, M. ASG-YOLOv5: Improved YOLOv5 unmanned aerial vehicle remote sensing aerial images scenario for small object detection based on attention and spatial gating. PLoS ONE 2024, 19, e0298698. [Google Scholar]
Chen, X.; Mottaghi, R.; Liu, X.; Fidler, S.; Urtasun, R.; Yuille, A. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1971–1978. [Google Scholar]
Patruno, C.; Nitti, M.; Petitti, A.; Stella, E.; D’Orazio, T. A vision-based approach for unmanned aerial vehicle landing. J. Intell. Robot. Syst. 2019, 95, 645–664. [Google Scholar]
Lin, S.; Jin, L.; Chen, Z. Real-time monocular vision system for UAV autonomous landing in outdoor low-illumination environments. Sensors 2021, 21, 6226. [Google Scholar]
Abdelaziz, N.; El-Rabbany, A. INS/LIDAR/Stereo SLAM integration for precision navigation in GNSS-denied environments. Sensors 2023, 23, 7424. [Google Scholar] [CrossRef]
Hao, Y.; He, M.; Liu, Y.; Liu, J.; Meng, Z. Range–visual–inertial odometry with coarse-to-fine image registration fusion for uav localization. Drones 2023, 7, 540. [Google Scholar]
Irfan, M.; Dalai, S.; Trslic, P.; Riordan, J.; Dooly, G. LSAF-LSTM-based self-adaptive multi-sensor fusion for robust UAV state estimation in challenging environments. Machines 2025, 13, 130. [Google Scholar]
Zhang, X.; He, Z.; Ma, Z.; Jun, P.; Yang, K. VIAE-Net: An end-to-end altitude estimation through monocular vision and inertial feature fusion neural networks for UAV autonomous landing. Sensors 2021, 21, 6302. [Google Scholar]
Vazquez-Meza, V.E.; Martinez-Carranza, J. Landing Zone Detection for MAVs using Depth Images and Vision Transformers. In Proceedings of the IMAV2024-19. International Micro Air Vehicle Conference and Competition (IMAV); IMAV Association: Strasbourg, France, 2024; pp. 163–170. [Google Scholar]
Petrlík, M.; Krajník, T.; Saska, M. LiDAR-based stabilization, navigation and localization for uavs operating in dark indoor environments. In Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 15–18 June 2021; pp. 243–251. [Google Scholar]
Pritzl, V.; Vrba, M.; Tortorici, C.; Ashour, R.; Saska, M. Adaptive estimation of UAV altitude in complex indoor environments using degraded and time-delayed measurements with time-varying uncertainties. Robot. Auton. Syst. 2023, 160, 104315. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar]
Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation; IEEE: New York, NY, USA, 2007; pp. 3565–3572. [Google Scholar]
Koubâa, A.; Allouch, A.; Alajlan, M.; Javed, Y.; Belghith, A.; Khalgui, M. Micro air vehicle link (mavlink) in a nutshell: A survey. IEEE Access 2019, 7, 87658–87680. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 898–916. [Google Scholar] [CrossRef]

Figure 1. Structure of hurdle recognition model.

Figure 2. Hurdle specifications.

Figure 3. Output of inference code.

Figure 4. Hardware configuration diagram of drone system.

Figure 5. F450 specification quadcopter drone used in experiments.

Figure 6. Drone trajectory colored in cyan showing rapid ascent and descent caused by the LiDAR beam hitting the hurdle structure. This demonstrates the importance of rear-mounted LiDAR placement and forward-approach trajectory design to avoid false altitude readings.

Figure 7. State diagram of autonomous perching algorithm.

Figure 8. Conceptual diagram of gimbal position projected on camera plane.

Figure 9. Flowchart of Perching Stage.

Figure 10. Target trajectory of perching stage.

Figure 11. VisDrone recognition results of models.

Figure 12. Hurdle recognition results of model.

Figure 13. Landing trajectory of full configuration.

Figure 14. Behavior around hurdle.

Figure 15. Execution period of autonomous perching algorithm and hurdle recognition model.

Figure 16. Chronophotography of drone trajectory colored in cyan during false positive detection in YOLOv12 ablation study. The model misidentified background trees as hurdles, causing the drone to change direction toward unintended targets. Manual emergency landing was performed just before collision with the tree to ensure safety.

Figure 17. Chronophotography of drone wandering behavior above the hurdle when the gimbal sweeping was disabled. Without sweeping capability, the drone failed to reacquire the target after losing visual contact, resulting in prolonged hovering above the hurdle area. The cyan curve indicates the approaching trajectory of drone from right to left toward the hurdle.

Table 1. Comparison of model performance on COCO and VisDrone.

Model	Comp. Load	COCO	VisDrone (Detection)
Model	GFLOPs	mAP50	mAP50-Val	mAP50-95-Val
YOLOv8n	8.2/12.1	0.447	0.331	0.194
YOLOv12n	6.5/10.4	0.480	0.326	0.190
YOLOv12n + P2 added	7.5/11.2	0.496	0.341	0.202
Proposed model	6.9/10.8	0.490	0.347	0.206

Table 2. Model performance on VisDrone from scratch.

Model	Comp.	VisDrone (Detection)
Model	GFLOPs	mAP50-Val	mAP50-95-Val	mAP50-Test	mAP50-95-Test
YOLOv8n	8.2	0.322	0.187	0.249	0.140
YOLOv12n	6.5	0.316	0.184	0.252	0.140
Proposed model	6.9	0.328	0.192	0.263	0.150

Table 3. Hurdle and horizontal bar recognition performance of models.

Model	Hurdle			Bar
Model	Prec	Rec	mAP50(M)	Prec	Rec	mAP50(M)
YOLOv8n (seg), bar + hurdle, with negative samples	0.838	0.867	0.831	0.976	0.774	0.766
YOLOv12n (seg), bar only, with negative samples	–	–	–	0.962	0.792	0.775
YOLOv12n (seg), bar + hurdle, without negative samples	0.744	0.890	0.879	0.934	0.799	0.794
YOLOv12n (seg), bar + hurdle, with negative samples	0.819	0.861	0.849	0.969	0.792	0.783
YOLOv12n + P2 added (seg), bar + hurdle, with negative samples	0.821	0.821	0.809	0.916	0.755	0.730
Proposed model (seg), bar + hurdle, with negative samples	0.884	0.884	0.870	0.962	0.786	0.775

Table 4. Hurdle recognition performance by object size.

Model	Hurdle						Bar
	Size > 1/9 Area			Size < 1/100 Area			Size > 1/9 Area			Size < 1/100 Area
	Prec	Rec	mAP50(M)	Prec	Rec	mAP50(M)	Prec	Rec	mAP50(M)	Prec	Rec	mAP50(M)
YOLOv8n	0.857	0.982	0.922	0.750	0.739	0.734	0.957	0.978	0.961	0.972	0.530	0.530
YOLOv12n	0.964	0.982	0.978	0.645	0.710	0.685	0.979	1.000	0.992	0.944	0.515	0.513
Proposed model	1.000	0.945	0.945	0.767	0.812	0.774	1.000	0.978	0.978	0.902	0.561	0.541

Table 5. Parameter list of perching algorithm.

Parameter	Value	Description	Parameter	Value	Description
W	0.3 (m)	Distance between camera and drone center	$d_{lidar, offset}$	0.05 (m)	Landing failure height reference offset
h	0.85 (m)	Height of hurdle horizontal bar	$d_{dist, land}$	150 (mm)	Distance criterion for landing on bar
$Δ t$	1/30 (s)	PID control period	$k_{x, land, perch}$	0.005 (m/s)	Target velocity when landed on bar
$k_{update}$	150	Number of frames to invalidate recognition result	$k_{z, land, perch}$	0.05 (m/s)
$k_{ϕ_{y}, sat}$	1	Gimbal control PID controller parameters	$h_{occlusion}$	$h - 0.15$ (m)	Lidar occlusion threshold
$k_{i, cam, sat}$	50		$k_{occlusion, perch}$	0.05 (m/s)	Velocity during lidar occlusion
$k_{d, cam, sat}$	0.6		$k_{z, perch_failed_flag, perch}$	0.03 (m/s)
$k_{p, cam}$	0.2		$k_{x, perch}$	0.1 (/rad s)	Perching stage x control parameter
$k_{i, cam}$	0.1		$k_{z 1, perch}$	0.05 (/s)	Perching stage z control parameters
$k_{d, cam}$	0.25		$k_{z 2, perch}$	0.05 (m/s)
$θ_{search, sweep, \min}$	$- 0.2$ (rad)	Gimbal sweeping related parameters	$k_{z 3, perch}$	0.1 (/s)
$θ_{search, sweep, \max}$	2.0 (rad)		$k_{z 4, perch}$	0 (m/s)
$θ_{search, sweep}$	0.004 (rad)		$k_{z 5, perch}$	0.01 (m/s)
$θ_{perch, sweep, \min}$	0.2 (rad)		$h_{hover}$	1.2 (m)
$θ_{perch, sweep, \max}$	2.5 (rad)		$γ_{max}$	0.35 (rad)
$θ_{perch, sweep}$	0.003 (rad)		$k_{y p b, perch}$	0.3
$θ_{start}$	1.22 (rad)	Gimbal angle threshold for approach→perching	$d_{dist, offset}$	$0.15$ (m) − h
$d_{start}$	2 (m)	Distance threshold for approach→perching	$z_{lidar, \min}$	$h + 0.35$ (m)
$k_{sat, prepare_reach}$	0.1 (m/s)	Prepare reaching PID controller parameters	$k_{θ 1}$	0.25 (rad)
$h_{prepare_reach}$	2.0 (m)		$k_{θ 2}$	0.15 (rad)
$K_{p, prepare_reach}$	0.3		$z_{lidar, counter, \min}$	h	Landing counter related parameters
$h_{reach}$	1.5 (m)	Altitude threshold for prepare reaching→reaching	$z_{lidar, counter, \max}$	$h + 0.2$ (m)
$h_{search}$	1.2 (m)	Altitude threshold for prepare reaching→searching	$d_{dist, perchcount}$	0.1 (m)
$v_{\max, reach}$	0.3 (m/s)	Reaching velocity	$k_{i, sat, perch}$	${[0.3 0.7 0.5]}^{T}$	Perching PID controller parameters
$k_{x, reach}$	1	Reaching PID controller parameters	$K_{p, perch}$	$[\begin{matrix} 0.42 & 0 & 0 \\ 0 & 0.14 & 0 \\ 0 & 0 & 0.7 \end{matrix}]$
$k_{y, reach}$	1		$K_{i, perch}$	$[\begin{matrix} 0.014 & 0 & 0 \\ 0 & 0.007 & 0 \\ 0 & 0 & 0.049 \end{matrix}]$
$k_{ψ, reach}$	1		$K_{d, perch}$	$[\begin{matrix} 0.07 & 0 & 0 \\ 0 & 0.035 & 0 \\ 0 & 0 & 0.042 \end{matrix}]$
$ω_{\max, reach}$	0.35 (rad/s)		$k_{sat, perch}$	0.16
$d_{dist, sweep}$	190 (mm)	Distance sensor threshold for gimbal sweeping	$k_{ψ, i, sat, perch}$	0.7
$k_{sweep, perch}$	0.015 (m/s)	Target velocity during perching gimbal sweeping	$k_{p, ψ, perch}$	0.05
$z_{lidar, perch_failed}$	1.2 (m)	Target altitude on landing failure	$k_{i, ψ, perch}$	0.04
$k_{x, perch_failed, perch}$	0.005 (m/s)	Target velocity on landing failure	$k_{d, ψ, perch}$	0.005
$k_{z, perch_failed, perch}$	0.015 (m/s)		$k_{ψ, sat, perch}$	0.15
$k_{perch_counter}$	50	Landing-counter threshold for transition to landing	$k_{z, land}$	0	z target velocity during land

Table 6. Performance comparison by components of autonomous perching algorithm.

Condition	Success Rate ( $n / N$ )			Time $μ \pm σ [\min, \max]$ (s)
Condition	Reaching	Perching	Landing	Reaching	Perching	Landing
Proposed model (seg), full configuration	6/6	6/6	6/6	$47.8 \pm 5.6 [41, 58]$	$87.7 \pm 19.3 [66, 123]$	$119.8 \pm 29.8 [85, 162]$
YOLOv12n (seg), model without negative samples	5/6	5/6	4/6	$50.0 \pm 3.0 [45, 53]$	$88.6 \pm 22.5 [61, 122]$	$145.3 \pm 51.3 [103, 233]$
Proposed model (seg), gimbal sweeping removed	6/6	6/6	3/6	$51.7 \pm 7.5 [38, 59]$	$136.8 \pm 59.7 [80, 264]$	$156.7 \pm 22.6 [133, 178]$
Proposed model (seg), using $f (t) = 0$ trajectory	6/6	6/6	1/6	$52.7 \pm 3.9 [47, 59]$	$94.8 \pm 31.4 [68, 142]$	$68.0 \pm 0.0 [68, 68]$
Proposed model (seg), no alignment during perching	6/6	5/6	5/6	$50.7 \pm 4.6 [44, 58]$	$75.8 \pm 4.7 [69, 82]$	$116.2 \pm 65.2 [74, 244]$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, D.; Chang, D.E. An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles. Drones 2026, 10, 270. https://doi.org/10.3390/drones10040270

AMA Style

Kim D, Chang DE. An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles. Drones. 2026; 10(4):270. https://doi.org/10.3390/drones10040270

Chicago/Turabian Style

Kim, Donghyun, and Dong Eui Chang. 2026. "An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles" Drones 10, no. 4: 270. https://doi.org/10.3390/drones10040270

APA Style

Kim, D., & Chang, D. E. (2026). An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles. Drones, 10(4), 270. https://doi.org/10.3390/drones10040270

Article Menu

An Onboard Integrated Perception and Control Framework for Autonomous Quadrotor UAV Perching on Markerless Hurdles

Highlights

Abstract

1. Introduction

2. Background

2.1. Autonomous Perching and Landing Systems and Hardware

2.2. Visual Servoing-Based Control

2.3. Small Object Detection and Lightweight Model Research

2.4. Sensor Fusion Research for Autonomous Flight

2.5. Research Motivation and Problem Definition

3. Hurdle Recognition Model

4. Autonomous Perching Algorithm

5. Results

5.1. Accuracy of Hurdle Recognition Model

5.2. Accuracy of Autonomous Perching Algorithm

5.3. Failure Cases

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI