1. Introduction
Recent advances in unmanned aerial vehicle (UAV) technology have significantly expanded the application scope of drones. However, limited battery capacity remains a major constraint for long-duration mission operations. To address this limitation, autonomous perching technology, which reduces energy consumption by staying on structures, has gained attention. Especially in scenarios requiring extended flight time, such as power line inspection and long-term wildfire monitoring, perching is a key technology that can greatly improve operational efficiency [
1,
2].
Unlike conventional ground landing, perching requires the UAV to detect thin, spatially limited structures such as poles and bars from a distance, approach them precisely, and settle on them stably. Outdoor environments make this problem substantially harder. They do not provide artificial markers, and they combine illumination changes, motion blur, occlusion, background clutter such as pipes and branches, and wind disturbances. These factors increase uncertainty in both perception and control, making practical outdoor perching difficult [
3,
4]. As a result, an outdoor perching system must simultaneously provide (i) robust structure detection in uncooperative environments, (ii) stable guidance and descent under perception uncertainty, and (iii) real-time operation under limited onboard computational resources.
This study proposes an autonomous perching system that allows a drone to land on the horizontal bar of a hurdle structure. The system integrates the full perching pipeline within a single framework. It detects and tracks the target using onboard vision and avoids separate guidance fixtures or mechanical locking devices beyond a single-axis gimbal and a distance sensor. The final objective is to place the hurdle bar between the four landing gears so that stable perching can be achieved through attitude control alone, without physical coupling.
This objective leads to two tightly coupled challenges. First, the actual perching target is the horizontal bar, but at long range, the bar is difficult to distinguish because of its simple geometry and its visual similarity to background structures such as pipes or branches. To improve robustness across distances, we detect both the full hurdle and the horizontal bar and use both outputs in control. This perception strategy is supported by hard-negative sampling, P2 feature utilization, and selective module lightweighting so that the detector can improve long-range small-object recognition, preserve close-range precision, and still satisfy real-time onboard constraints. Second, the recognition outputs must be transferred to the controller without delay or instability. We therefore integrate a hurdle-specific detection model with a PID-based control loop, where the model is trained on approach video frames captured by the onboard camera, and the controller uses the estimated hurdle position and orientation to execute the perching maneuver.
Based on these problem settings, the proposed approach adopts several design decisions that reflect both perching requirements and platform constraints. We build on the YOLO family because it offers a strong speed–accuracy trade-off and a mature deployment ecosystem, and we select YOLOv12n as a lightweight baseline that can be adapted for this task through a P2 path and selective module simplification. Rather than detecting only the crossbar, the perception module jointly recognizes the hurdle frame as the whole object and the crossbar as its part. This whole-part formulation reduces scale-dependent ambiguity and helps suppress false positives from visually similar bar-like structures, enabling a more reliable transition from long-range approach to near-range alignment.
On the control side, we implement a modular state-machine architecture that separates the theoretically grounded reaching phases from the empirically tuned perching phases. This structure simplifies implementation and tuning while preserving explicit failure-handling logic for outdoor operation, where available sensor information changes with distance and flight condition. The hardware configuration is also deliberately minimal: the system relies on standard UAV components, including stereo cameras, a gimbal system, and a 1D LiDAR, and adds only a single ToF sensor. As a result, the platform provides the sensing functions required for perching while keeping hardware dependency low and maintaining portability to existing drone systems.
The contributions of this paper are as follows:
General-purpose perching system with low hardware dependency: We implement perching on a standard F450 frame without adding special mechanical structures. The system uses only one extra ToF sensor in addition to a gimbal camera and LiDAR for altitude estimation.
Lightweight model for small-object recognition: Based on YOLOv12n [
5], we design a model for small-object detection with a computational cost of about 10 GFLOPs. The model adds a P2 path and applies selective module lightweighting.
Real-environment dataset construction: We construct a hurdle dataset that includes a range of distances, viewing angles, lighting conditions, and hurdle-like distractors, and we use it for training.
Integrated perception and control on embedded hardware: We implement the full autonomous perching system so that detection and control operate in real time on a Jetson Orin NX board.
Real-flight verification and ablation study: We evaluate perching performance through outdoor flight experiments and use ablation studies to identify the contribution of each major control component.
The remainder of this paper is organized as follows.
Section 2 reviews related work and clarifies the gaps addressed in this study.
Section 3 then presents the hurdle recognition model, and
Section 4 describes the autonomous perching algorithm that uses the recognition outputs.
Section 5 reports the experimental results, including both quantitative performance and representative failure cases. Finally,
Section 6 summarizes the main findings and discusses limitations.
3. Hurdle Recognition Model
The hurdle recognition model proposed in this study is based on YOLOv12n, a lightweight detector with a computational cost of approximately 10 GFLOPs [
5]. We selected YOLOv12n because the YOLO family provides a strong speed–accuracy trade-off and is widely used in aerial imaging applications. Among recent YOLO variants, YOLOv12n offers a particularly favorable performance-to-size ratio on standard benchmarks such as COCO, which makes it suitable for resource-constrained onboard deployment.
At first glance, it may seem sufficient to detect only the hurdle’s horizontal bar, since that bar is the final perching target. In outdoor environments, however, many structures, such as pipes and tree branches, resemble horizontal bars. The bar itself also occupies only a small image region, especially at long range. For this reason, we train the model to detect both the whole hurdle and the horizontal bar. We then design the control pipeline to use both outputs. The standard YOLOv12n architecture is also limited in its ability to preserve fine features for distant hurdles. To address these issues, we modify the model architecture and construct a custom hurdle dataset. The final model is converted to a TensorRT engine and optimized for real-time inference above 10 FPS on the Jetson Orin NX edge board.
Figure 1 shows the detailed structure of the proposed hurdle recognition model. The modified parts compared to YOLOv12n are highlighted in red. The most significant structural modification is the introduction of the P2 feature level to improve small object detection performance. The existing YOLOv12n FPN-PAN structure uses P3–P5 level feature maps, but this lacks sufficient resolution to capture fine structures such as thin horizontal bars or vertical poles at long distances. To compensate for this, we add a branch that upsamples the P3 feature map in the Neck stage, combines it with the P2 features from the Backbone, and then downsamples it again to pass to upper layers. The P2 feature map maintains high resolution and makes a critical contribution to recognizing small objects at long distances.
However, the addition of the P2 feature level increases computational load and memory usage. To offset this and secure real-time performance on edge devices, this study replaces some A2C2f modules that do not significantly affect performance with lighter C3k2 modules, and selectively applies GhostConv modules [
33,
34] that can reduce the number of parameters while maintaining feature extraction capability. These module replacements and structural changes are determined through iterative training speed experiments using the COCO dataset.
The hurdle dataset constructed for training and validation targets four types of hurdles with identical specifications but different colors, and uses video frames captured under various weather, lighting, distance, and angle conditions. The dataset has the following main characteristics. First, we construct data reflecting the constraints of actual flight environments. By including numerous motion blur and long-distance low-resolution images that can occur during actual flight, we enable the model to operate robustly even with unclear inputs. Second, we employ negative samples for false positive suppression. To suppress false positive detection, we actively include in the training data two types of negative samples: (i) background images where no hurdles exist, and (ii) images containing confusing structures with shapes similar to hurdles (horizontal bars, gazebos, pipes, tree branches, drone landing gear, disassembled hurdle parts, etc.). In particular, 505 images in the training dataset are composed solely of such negative samples to induce the model to clearly distinguish between background noise and actual hurdles. Finally, the hurdle dataset consists of 1964 training images extracted from 14 videos filmed using 3 types of hurdles and 99 validation images extracted from 1 video filmed using 1 type of hurdle. The test dataset uses 237 images extracted from 4 videos containing 4 types of hurdles, adding 1 type of hurdle that is not used in training.
Figure 2 shows the physical specifications of the hurdle used in the experiments. The hurdle is manufactured using 50 A specification PVC pipes, consisting of a 1 m long horizontal bar, a 0.7 m long vertical pole, and four 0.5 m pipes to support the structure from the ground. Four types of hurdles painted in different colors are utilized for constructing the training and evaluation datasets.
The model’s inference results are converted into state information (center position, inclination, distance) required for drone control through a post-processing algorithm. The post-processing algorithm is presented in Algorithm 1. The inference code receives images with a 640 × 480 resolution and generates segmentation masks for hurdles and horizontal bars. At long distances where horizontal bars are not identified, the geometric center of the entire mask, recognized as a hurdle, is set as the target point to guide the approach. At this time, the model sometimes misrecognizes background noise while failing to recognize hurdles. To address this, when the area recognized as a hurdle in the current frame is significantly smaller than in previous frames, this represents a misrecognition and is filtered out. Conversely, when entering close range and a horizontal bar is detected, the smallest rotated bounding box containing the largest area of the entire horizontal bar mask is generated, and the center coordinates and inclination of this box are used as control inputs. The physical distance to the target point is determined by the top 25% value among the depth values of each pixel within the area of the hurdle or horizontal bar.
Figure 3 shows an example of the output data finally provided by the inference module. For camera images, the inference module provides the center coordinates of the hurdle or horizontal bar in the image, the angle of the rotated bounding box enclosing the horizontal bar, the short side length and long side length, and the distance value.
The main heuristic choices in Algorithm 1 were selected to stabilize control-relevant quantities rather than to optimize pixel-level segmentation quality. The history-based area filter rejects frames whose detected hurdle area becomes much smaller than the recent history mean. This rule is used only in the red-only branch, which typically corresponds to the far-range regime, where the hurdle occupies a small image region and isolated false positives often appear as tiny, short-lived masks. The factor
was chosen as a conservative threshold that removes these transient detections without suppressing normal area fluctuations caused by moderate viewpoint change. The largest-component rule in the near-range branch is used because the controller needs only one bar toward which to guide the drone, and considering that all bar candidates can misdirect it when some are false positives. For distance estimation, we use the 25th percentile of masked depth values instead of the minimum or the mean. The minimum was too sensitive to sparse outliers, while the mean was biased by background pixels near mask boundaries. The lower percentile provided a more stable estimate of the front-side structure depth, which is the quantity most relevant for approach control.
| Algorithm 1 Segmentation post-processing and control-state extraction |
Require: Segmentation image (BGR), depth map (optional, 16UC1) Ensure:
Constants: history length , noise factor , depth percentile , max samples 20,000 |
|
1: /* persistent state */ (initialized to 0), history queue (keep last N)
|
| 2: Split S into channels |
| 3: , | ▹∨: per-pixel OR on 8-bit masks |
| 4:
CountNonZero,
CountNonZero |
| 5: Append to ; if , pop oldest |
| 6: if
then return |
| 7: if then | ▹ red-only branch (far range) |
| 8:
Mean |
| 9: if then return | ▹ noise rejection |
| 10:
Moments |
| 11: if then return |
| 12: , , , |
| 13: |
| 14: else | ▹ green-present branch (near range) |
| 15:
ConnectedComponentsWithStats |
| 16: if then return |
| 17: | ▹ largest component mask, 8-bit |
| 18:
FindNonZero(L) | ▹ list of pixel coordinates |
| 19: if then return |
| 20: MinAreaRect(P) |
| 21: , , |
| 22: for | ▹ 4 vertices |
| 23: /* slope computation exactly as implemented */ |
| 24: if then | ▹ short side equals height |
| 25: |
| 26: else |
| 27: |
| 28: end if |
| 29: |
| 30: |
| 31: end if |
| 32: /* distance rule: masked depth percentile */ |
| 33: if D not available then |
| 34: |
| 35: else |
| 36: |
| 37: if then |
| 38: |
| 39: else |
| 40: if then |
| 41: uniform subsample of V to size |
| 42: end if |
| 43: | ▹ value at rank k (no full sort) |
| 44: end if |
| 45: end if |
| 46: return
|
4. Autonomous Perching Algorithm
The proposed autonomous perching system integrates vision-based state estimation and control on an F450-frame quadcopter. For outdoor flights, GPS is excluded, and position is estimated only through stereo-camera-based MSCKF-VIO, assuming a GNSS-denied environment.
Figure 4 shows the complete hardware architecture and the interconnection of all components.
Figure 5 shows the resulting F450-based platform used in the experiments.
An NVIDIA Jetson Orin NX serves as the mission computer for perception and high-level control, while a Pixhawk 6C handles low-level attitude control. Two RealSense stereo cameras provide visual information. One is fixed to the front of the vehicle for VIO. The other is mounted on a single-axis gimbal driven by an XL330 servo for hurdle tracking. A 1D LiDAR (Lidar Lite V3) and a ToF distance sensor (VL53L0X) are used for altitude measurement and short-range distance checking during perching, respectively. This sensor configuration was chosen to maximize portability and ease of integration. Stereo cameras, a gimbal, and a 1D LiDAR are common UAV components, so the platform is easy to reproduce. By adding only a ToF sensor to this standard setup, we keep hardware dependency low while preserving the sensing functions required for markerless perching. The ToF sensor is reserved for the final stage of approach. Its effective range is limited to about 30 cm because the hurdle bar is thin, rounded, and weakly reflective.
One important design choice is the placement of the 1D LiDAR at the rear of the drone rather than directly beneath its center. This choice reduces the risk of false altitude measurements during hurdle approach. If the LiDAR were mounted in a conventional downward-facing position under the vehicle center, its beam could intersect the hurdle during descent. The controller would then interpret the shortened range as rapid altitude loss. That false reading would trigger an unnecessary ascent command, as illustrated in
Figure 6, and could produce unstable oscillatory motion. By mounting the LiDAR at the rear and adopting a forward-then-backward approach trajectory, we keep the LiDAR beam clear of the hurdle for most of the maneuver and maintain a more reliable altitude reference.
In this study, considering limited hardware resources and uncertain outdoor environments, the design requirements for the autonomous landing algorithm are specified as follows:
Robustness to High Position Uncertainty: This system relies entirely on VIO with possible accumulated errors, without absolute position sensors such as GPS or motion capture systems. Therefore, the control algorithm must have robust tracking performance that can tolerate the inevitably occurring drift and noise in position estimation [
35].
Perching Phase Independent of Depth Information: During the descent phase, when the drone approaches the hurdle, depth information cannot be obtained because the target object enters within the stereo camera’s minimum measurement distance. Therefore, during the descent phase, precise control needs to be possible using only 2D image features and ToF sensor data without relying on depth information.
Lightweight Design for Edge Computing Environment: The Jetson Orin NX mounted on the drone needs to simultaneously perform VIO, object recognition, gimbal, and velocity control. Therefore, the entire algorithm requires lightweighting to operate stably within limited computational resources.
Response to Irregular Control Periods: Due to the high computational demands of VIO and the hurdle recognition model, the control loop period () may fluctuate severely. The algorithm needs to be designed to land autonomously and stably despite irregularities in control periods.
To satisfy the above requirements, this study proposes the state-machine control structure shown in
Figure 7. The overall process consists of five stages.
Prepare Reaching performs altitude adjustment and initialization.
Reaching tracks and approaches the hurdle from long range.
Perching descends precisely toward the horizontal bar from above.
Landing stabilizes posture after the drone makes contact with and becomes supported by the bar.
Searching scans the surroundings when recognition fails.
The state machine is adopted because it separates the theoretically grounded Reaching phases from the empirically tuned Perching phases. This separation simplifies both development and integration. It also allows each phase to be optimized independently while keeping the overall logic explicit. In particular, the Prepare Reaching stage is separated from Reaching so that altitude can be regulated by LiDAR before forward motion begins. If altitude were adjusted during forward approach, terrain-dependent LiDAR variations could cause unnecessary vertical corrections and make the vehicle altitude oscillate while it is moving toward the hurdle. Separating the stages, therefore, reduces the risk that VIO drift or terrain-induced vertical oscillation will destabilize the approach.
Across all stages, the gimbal command is updated continuously to track the hurdle. When the hurdle is not detected, the gimbal switches to sweeping motion. In Prepare Reaching, the drone mainly regulates altitude using the LiDAR measurement while keeping horizontal motion close to zero. In Reaching, it moves toward the hurdle with commanded x/y velocities derived from the hurdle position calculated from the camera measurements, while keeping vertical velocity at zero. In Perching, the y and yaw commands are chosen to center and align the hurdle in the image. The target gimbal angle on the desired trajectory is computed from the LiDAR-based altitude estimate. The x velocity is then determined from the difference between the target and current gimbal angles. The z velocity is commanded downward only when the drone is sufficiently close to the desired trajectory and satisfies the y/yaw alignment conditions; otherwise, the controller commands ascent or hover for recovery. In Searching, the drone suppresses aggressive translational motion and relies on gimbal sweeping to reacquire the target. The detailed control law for each stage is described below.
The drone body coordinate system is denoted as
, the gimbal coordinate system as
, the camera coordinate system as
, the virtual camera coordinate system obtained by translating the inertial coordinate system to be centered at the drone’s center as
, and the inertial coordinate system as
. Because the PX4 flight controller receives external command inputs through MAVLink, we use the MAVLink command convention in this study [
36]. Due to the structure of the MAVLink interface, all drone velocity and angular velocity control commands are issued through the virtual camera coordinate system. For computational efficiency during close-range perching operations, where attitude variations are typically small, and the system operates in hover-like conditions, we employ a simplified coordinate transformation that neglects higher-order terms in the rotation matrix expansion. This approximation is valid when the drone maintains relatively stable attitude angles and operates at low translational and rotational velocities, which are characteristic conditions during the precision perching phase. The relationship between the drone body and virtual camera coordinate systems can be expressed as follows:
where
is the drone’s current yaw direction.
In all flight stages, the gimbal is controlled to continuously track the horizontal bar of the hurdle, and when the hurdle is not detected, it performs a sweeping motion in a designated pattern to maintain the field of view.
Figure 8 shows the geometric relationship of the hurdle projected on the camera plane. When the center coordinates of the hurdle on the image plane are
, the vertical error
from the image center
and the corresponding pitch angle error
of the hurdle are as follows:
where
is the
y-axis focal length of the camera. The pitch angle
of the hurdle with respect to the body coordinate system
is estimated as follows using the current gimbal pitch angle
.
Gimbal control uses PID control with anti-windup in the discrete-time domain. When defining the control error as
, the calculation of each term and the final control output
are as follows:
The searching stage is a state where the structure has not been detected. The gimbal performs a sweeping motion at a constant speed to search for structures, and the drone’s target velocity is set to 0 in all directions. Gimbal angle updates use the following conditional incremental method.
In the prepare reaching stage, the drone’s altitude adjustment occurs. P control is performed on the altitude error for drone altitude adjustment. If the drone’s movement is sufficiently slow, the lidar measurement value can be approximated without transformation and used as altitude. The target velocity
and rotational velocity
in the virtual camera coordinate system are calculated as follows:
where
is the altitude error calculated as the difference between the target altitude and the lidar-measured altitude. When the drone altitude
measured from the lidar satisfies
, it transitions to the reaching stage, and when it satisfies
while the hurdle is not detected, it transitions to the searching stage.
In the reaching stage, when the image coordinates
of the hurdle horizontal bar, camera parameters
, and distance
to the hurdle center obtained from stereo images are given, the position of the hurdle is as follows:
where
is the 3D position of the hurdle in the camera coordinate system. Applying rotation due to camera installation direction and rotation by the single-axis gimbal, we obtain the position of the hurdle in the body coordinate system
as follows:
with
where
is the 3D position of the hurdle in the body coordinate system. In the reaching stage, altitude velocity is maintained at 0, and the drone approaches the target at a constant speed. To achieve this controlled approach, the target velocities
and
are calculated as follows:
where
is the horizontal distance from the drone to the hurdle,
and
are the horizontal coordinates of the hurdle in the body coordinate system, and
is the maximum approach speed. P control is used in the reaching stage, and the control input
and
are calculated as follows:
where
,
, and
are the P control gains for each direction, and
is the maximum rotational speed limit.
In the reaching stage, when the horizontal bar of the hurdle approaches below a certain distance, and the gimbal angle tracking the horizontal bar reaches above a certain level, it transitions to the perching stage as
where
is the threshold angle indicating that the gimbal is sufficiently pointing downward, and
is the threshold distance for target proximity determination. Also, if the hurdle recognition result has not been updated for
cycles, it transitions to the searching stage.
Figure 9 shows the flowchart of the perching stage. In the perching stage, as the drone approaches the hurdle, distance calculation through stereo matching becomes inaccurate, and segmentation performance degrades due to occlusion. Therefore, in the perching stage, control is performed without using depth information.
First, the hurdle horizontal bar pitch angle
by the gimbal camera and the gimbal target angle change
for the next time point are calculated. If the hurdle recognition result was updated more recently than
control cycles, the gimbal is moved by the corresponding gimbal target angle change. Otherwise, the distance to the bar measured from the distance sensor is checked. If the distance is below
, the drone is already very close to the bar and is just before contact, currently directly above the bar, and landing is possible by descending, so the camera does not search for the bar again. Otherwise, the bar must be found again, so the gimbal performs sweeping while simultaneously applying the following velocity command to slowly raise the drone’s altitude:
As a result, if the drone’s altitude becomes
or higher, it transitions to searching mode.
In the perching stage as well, if the structure is lost during descent, sweeping occurs as
In the perching stage, unlike the searching stage, the hurdle is likely to be below the drone, so unlike the searching stage, which performs sweeping while looking forward, the perching stage performs sweeping while looking at the ground.
If not yet transitioned to searching mode, the target velocity in perching mode is determined. The target velocity in
y/yaw direction is determined through the
x-coordinate and inclination of the hurdle horizontal bar in the image. To generate control input in a situation where the actual distance from the aircraft center to the horizontal bar is unknown, a projection point
in the camera coordinate system, assuming a virtual distance
, is defined. The target angular velocity
in the perching stage assumes a situation where the aircraft is close to above the hurdle horizontal bar and the gimbal camera projects nearly perpendicular toward the ground (
). Under this assumption, the projection
and the corresponding body frame coordinate
are as follows:
where
and
are the rotation matrices defined in the reaching stage. The target velocity in the perching stage
and the target angular velocity
are set as follows:
where
is the desired position error in the
y direction, and
is the inclination of the hurdle horizontal bar in the image. The velocity vector
is input to the PID controller.
The target velocity in the
directions is determined by several factors: the observed hurdle angle
, the target bar angle
at the current altitude, the LiDAR altitude
, the LiDAR+IMU corrected altitude
, the
y/yaw alignment state, and the status of the
perch_failed flag. The
perch_failed flag is asserted when a false landing is attempted. In that case, the drone slowly reverses while climbing to
. If the
perch_failed flag is asserted, the target velocity in the drone body frame is
is used. The
perch_failed flag is released when
.
If the drone altitude
falls below
, the drone is in one of three states: it has landed on the bar, it has failed to land, or LiDAR occlusion has occurred because the beam hit the hurdle bar. If the distance sensor measurement satisfies
, the drone is judged to have landed on the bar, and the target velocity is
and the landing counter is increased. Otherwise, if
, LiDAR occlusion is assumed to have occurred. Because the LiDAR is placed at the rear bottom of the drone, the drone must reverse to escape this state. The target velocity then becomes
is set. If neither case applies, a false landing was attempted. In this case, the
perch_failed flag is asserted, and the target velocity at this time point is
to attempt ascent.
Figure 10 shows the target trajectory used in the perching stage. In the figure, the drone approaches from the right, and the horizontal bar of the hurdle is located at
, where
h is the hurdle height and
w is the distance between the drone center and the camera. The perching target trajectory is defined as
. This trajectory is designed to keep the rear-mounted LiDAR beam from being occluded by the hurdle during descent and to reduce rear-looking views of the gimbal camera. This is important because, during the final approach, rear-looking views increase the likelihood that the drone body enters the image, and we empirically observed that the recognition model can falsely detect parts of the airframe, particularly the landing gear, as hurdle-like structures. Reducing such views, therefore, improves visual recognition reliability in the final approach.
For the above reasons, the trajectory of descending from the front of the hurdle, moving forward directly above the horizontal bar, and then reversing,
is adopted. The functional form was chosen to realize a smooth forward-then-backward motion with a single transition around the bar height; in particular, the shape of
was adopted because it naturally yields a continuous profile that first advances and then retreats as altitude changes. Within this functional form, the numerical coefficients were then empirically tuned through preliminary flight tests so as to reduce the total duration of the landing process while preserving successful perching behavior. At each
z,
so,
This is used as the target gimbal angle according to the drone’s altitude.
The target velocity in the
x direction is provided from the altitude measured by the lidar and the current angle of the gimbal. Specifically, this velocity component
is calculated using the difference between the target and current gimbal angles as follows:
where
is the altitude measurement of the lidar.
The target velocity in the
z direction has a constant value depending on whether the bar’s
y/yaw is aligned and whether the gimbal angle is in the target range. If the bar’s
y/yaw is not aligned or the gimbal angle significantly deviates from the target range, the
z direction target velocity is set so that the drone altitude becomes
. Otherwise, the drone stops or descends. The target velocity in the
z direction
is calculated as follows:
where
.
As the drone descends, if the distance sensor measurement and drone altitude satisfy the conditions, the landing counter is incremented. Otherwise, the landing counter is reset to 0. So,
Subsequently, transformation to the inertial coordinate system and PID control are performed. The final velocity command is,
When the value of the perching counter perch_counter satisfies , it transitions to the landing stage. The landing stage is the post-contact stage in which the drone remains seated on the horizontal bar. For generality, we write the target velocity as . An optional small downward bias was considered in the early design stage to press the vehicle slightly onto the bar. In the final implementation, however, we set , because once the vehicle is already supported by the bar, a persistent negative vertical-velocity command is not physically realizable and may adversely affect the internal PX4 velocity PID controller. Accordingly, in the implemented system, the landing stage applies zero translational velocity and relies on the PX4 attitude controller to maintain balance.
5. Results
This section is organized into three parts. We first evaluate the hurdle recognition model, then analyze end-to-end autonomous perching performance, and finally examine representative failure cases that illustrate practical limitations and motivate the main design choices.
5.1. Accuracy of Hurdle Recognition Model
Before evaluating the integrated perching system, we first analyze the detection accuracy of the enhanced YOLOv12n-based hurdle recognition model across several datasets and scenarios. This subsection establishes the perception capability on which the later control evaluation depends. This subsection includes two main elements. We first verify that the proposed architectural changes improve general detection quality on standard aerial-imaging benchmarks. We then examine hurdle-specific recognition performance on our custom dataset. To quantify the effect of each proposed model component, we report three training settings. First, in the COCO-to-VisDrone setting, each comparison model is pre-trained on COCO for 100 epochs, and the resulting COCO mAP50(M) is reported after pre-training. Although this is not the final task performance, it is still useful for comparing how structural changes affect general visual representation learning. Because VisDrone does not provide segmentation labels, the segmentation head is replaced with YOLOv12’s detection head, while all remaining weights are retained except for the detection head. These head-replaced, COCO-pre-trained models are then additionally trained on VisDrone for 200 epochs to calculate mAP50(M) and mAP50:95(M). Second, in the COCO-to-hurdle setting, the COCO-pre-trained models are additionally trained on the hurdle dataset for 140 epochs, and performance is evaluated by calculating precision and recall for each of the hurdle (whole) and horizontal bar (part) classes. For these two trainings, we use a batch size of 16, the auto optimizer setting, and overlap_mask=false to prevent the hurdle and bar masks from being merged when they overlap spatially. All other hyperparameter settings follow the default parameters of the yolo library.
Third, in the VisDrone-scratch setting, the models are trained on VisDrone from the beginning without COCO pre-training. In this third setting, the proposed model’s Segment head is replaced with YOLOv12’s Detection head and trained on the VisDrone dataset for 300 epochs. The input image size is , the batch size is 8, and the optimizer is SGD with learning rate 0.01 and momentum 0.937. The remaining hyperparameter settings follow the default parameters of the yolo library. Additionally, in model comparisons, not only accuracy metrics but also computational complexity, such as computational load (GFLOPs), are presented together, enabling simultaneous discussion of the performance-efficiency trade-off.
Hurdles photographed from long distances appear as very small objects in the image, and YOLO-series segmentation models tend to overestimate their masks by including surrounding background [
37]. In our application, however, the main purpose of the recognition model is to estimate the hurdle center and inclination for control. Stable center estimation is therefore more important than pixel-accurate mask boundaries. For this reason, performance evaluation on the hurdle dataset adopts the matching-tolerance concept from BSDS500 [
38], a boundary-detection benchmark that calculates IoU after expanding both the ground-truth and predicted masks by a fixed amount. Specifically, when the image diagonal length is
, where
H and
W are the image height and width, the tolerance radius is defined as
In the input resolution (
) of this experiment,
, so
pixels. This
is reflected as one dilation using an elliptic structural element in the code implementation. At this time, for a binary mask
(
is the pixel grid) and structural element
, the dilation operation is denoted as
Subsequently, the relaxed IoU is calculated as the IoU of the dilated masks as
In this study, when
is satisfied, this constitutes “detection success”.
Additionally, when multiple hurdles exist in an image or parts of hurdles are occluded, defining TP/FP/FN at the instance level using only pixel-wise IoU becomes difficult. Therefore, this study extracted connected components based on 8-connectivity from the dilated masks and performed region-level matching. Each predicted region selects the item with maximum IoU among ground truth regions not yet matched, and if that value is above the threshold
, it is matched as TP. Unmatched predictions are counted as FP, and unmatched ground truths as FN. Final precision/recall is calculated as
In autonomous landing problems, temporary non-detection of hurdles can be resolved by the passage of time or movement of the drone, but incorrectly detecting structures that are not hurdles creates movement to the wrong locations. Consequently, in this study, precision is considered the highest priority, followed by recall. If the center coordinates and inclination of objects recognized by the model are accurate, the shape of the segment is relatively less important for drone control, so mAP50 is used only as a reference measure.
Table 1 reports the COCO pre-training results together with the results of the first setting, namely, additional training on VisDrone after COCO pre-training. The listed computational costs correspond to the final detection and segmentation configurations used for VisDrone and hurdle evaluation, respectively. To isolate the effect of each architectural change, we also include an intermediate model in which only the P2 path is added to the YOLOv12 baseline. Adding the P2 path improves both COCO performance after pre-training and VisDrone performance, suggesting benefits for both general feature learning and small-object detection. The subsequent lightweighting step recovers 0.6/0.4 GFLOPs out of the 1.0/0.8 GFLOPs increase introduced by the P2 path, while still providing additional performance gains on VisDrone. These COCO-VisDrone benchmark-level results are important because they show that the proposed modifications are not only tuned to the custom hurdle dataset. Instead, they improve the model in a way that remains visible on broader aerial-imaging tasks. This makes the later hurdle-specific analysis more convincing.
Table 2 shows the performance of scratch training on VisDrone without COCO pre-training. Examples of recognition results of the three models are shown in
Figure 11. The proposed model outperforms the baseline YOLOv12n and also shows better performance than YOLOv8n. The scratch-training results lead to the same conclusion as the pre-trained comparison. The proposed architecture retains an accuracy advantage even without relying on COCO initialization. This consistency suggests that the gain arises from the architecture itself rather than from COCO initialization alone.
We next examine COCO pre-trained hurdle-dataset results in order to compare the practical effects of negative-sample training, whole-part joint learning, and the proposed architectural modifications under the target task itself. This comparison is especially relevant because these choices primarily affect the precision–recall trade-off that matters during real flight.
Table 3 summarizes the effect of model structure and training strategy on hurdle recognition. Training with negative samples causes a slight decrease in recall and mAP50, but it improves precision substantially, from 0.744 to 0.819 for hurdles and from 0.934 to 0.969 for horizontal bars. This result indicates that negative samples effectively suppress confusion with hurdle-like background structures. This precision gain is operationally important. In autonomous perching, a false positive can direct the vehicle toward a dangerous or irrelevant structure. By contrast, a temporary missed detection can often be recovered by continued motion or target reacquisition.
Compared with the single-class setting that detects only the horizontal bar, the joint setting that detects both the hurdle and the bar maintains bar recall at 0.792 while slightly improving precision from 0.962 to 0.969 and mAP50 from 0.775 to 0.783. This suggests that multi-class learning can exploit the structural context of the whole hurdle without degrading bar-recognition performance, which is consistent with the whole–part recognition strategy adopted in the controller design.
The proposed structural modifications increase computational load by approximately 5.8%. Even so, they improve mAP50 at the COCO pre-training stage by about 1 percentage point and improve hurdle precision from 0.819 to 0.884, a gain of about 6.5 percentage points. These results suggest that the high-resolution P2 path contributes meaningfully to recognizing small hurdles at long range. Horizontal-bar performance degrades slightly, but this trade-off is acceptable for our system. The controller uses the whole hurdle during the long-range approach and switches to the horizontal bar only at close range. Most recognition failures occur during long-range hurdle detection. For that reason, the gain in hurdle recognition is more important than the small loss in bar performance. Overall, these results support the proposed model as the perception front end for the full autonomous perching pipeline. The aggregate results above suggest that the main benefit of the proposed architecture should appear in the long-range regime, where hurdles occupy only a small portion of the image. To test this point directly, we split the hurdle-dataset samples into cases in which the hurdle occupies more than 1/9 of the image area and cases in which it occupies less than 1/100, corresponding approximately to close-range and long-range operation, respectively. The table below shows that the structural modifications mainly improve performance in the small-hurdle regime.
Table 4 confirms that the advantage of the proposed model appears mainly in the small-hurdle regime, whereas all models perform similarly when the hurdle is already large in the image. This is the more important regime for our application because long-range recognition strongly affects whether the subsequent approach can proceed stably.
For small hurdles, the proposed model improves precision from 0.645 to 0.767 and recall from 0.710 to 0.812 relative to YOLOv12n, while mAP50(M) increases from 0.685 to 0.774. YOLOv8n also improves small-hurdle performance relative to YOLOv12n, but it requires 8.2 GFLOPs, compared with 6.5 for YOLOv12n and 6.9 for the proposed model. The proposed model, therefore, provides the best balance between long-range hurdle recognition and onboard computational cost, which supports the benefit of the added high-resolution P2 feature level.
For small bars, the proposed structural modifications do not produce a clearly meaningful performance gain. This is not unexpected, because the simple shape of a small horizontal bar makes it easy to confuse with similar objects such as branches or drainpipes, and at long range, the loss of fine visual detail makes it blend more easily into the background. In practice, however, this limitation is acceptable for our system, because the controller mainly relies on whole-hurdle recognition during the long-range approach stage and switches to bar-based guidance only after the vehicle has moved sufficiently close to the target.
Figure 12 shows the hurdle recognition results of the proposed model. The inference results corresponding to the original images at the top are presented as mask images at the bottom, with the entire hurdle area visualized in red, the horizontal bar area in green, and the overlapping area of the two in yellow. The examples show that even when multiple hurdles exist within the field of view, each object is clearly separated and recognized. In the shown cases, structures visually similar to hurdles are not falsely detected, and hurdle components remain detectable even at long range or under partial occlusion. Taken together, these results show that the proposed hurdle-recognition model can reliably detect the whole hurdle at long range while suppressing false recognition of hurdle-like structures.
5.2. Accuracy of Autonomous Perching Algorithm
Having established the detection capability of the recognition model, we now evaluate the end-to-end performance of the complete autonomous perching system through real outdoor flight experiments. This subsection analyzes six outdoor flight experiments designed to evaluate both the overall performance of the proposed system and the contribution of its main components. All experiments use a scenario in which the drone takes off and attempts to land on a hurdle located approximately 10 m ahead. To avoid overfitting the evaluation to a single initial pose, the takeoff position is randomized within a radius of 1 m around a nominal point 10 m in front of the hurdle. This setup allows us to assess robustness under realistic initial-condition variation.
For performance evaluation, we distinguish between ‘perching’ and ‘landing’ in order to separate initial bar contact from subsequent stable post-contact behavior. This distinction is important because successful contact does not necessarily guarantee a successful landing; posture misalignment can still cause failure, and the sensors may also fail to detect that the drone is already seated on the hurdle, which can trigger a retry.
The autonomous perching algorithm uses the parameter set summarized in
Table 5. The camera parameters
, where
and
denote the focal lengths and
and
denote the principal point coordinates, use the default RealSense calibration values. The proposed system is designed for a standardized F450 frame, and the control parameters were determined through real-flight experiments.
Although the parameters in
Table 5 are empirical, they are not arbitrary. The transition thresholds are tied to sensing validity and vehicle geometry. For example,
and
are selected so that the controller switches to perching only after the hurdle is sufficiently large in the image and the gimbal angle indicates that the vehicle is close enough for bar-centered alignment. The altitude-related thresholds are likewise set with consideration of the hurdle height, landing-gear clearance, and the rear-mounted LiDAR offset, which helps reduce the likelihood of incorrect decisions during the perching stage. The perching-stage gains were then empirically fixed through extensive parameter exploration in real-flight experiments so as to provide stable descent and alignment behavior during perching.
Performance is evaluated by dividing the flight process into three stages: ‘reaching’, ‘perching’, and ‘landing’. For each stage, we record success or failure and the corresponding time distribution. Here, the ‘perching’ stage means that the rectangular area formed by the four landing-gear endpoints is in physical contact with the hurdle bar, but stable landing is not yet guaranteed. The ‘landing’ stage is defined as the earliest time point at which, after the drone has descended onto the bar, continuous contact between the bottom surface of the drone and the bar begins to be maintained. Therefore, inadequate posture alignment during perching can still lead to landing failure and a subsequent re-approach.
Table 6 summarizes the experimental results. Stage durations are analyzed only for successful trials. A low perching success rate relative to the reaching success rate indicates tracking failure during descent. In most cases, this failure is caused by rapid changes in the hurdle’s gaze angular velocity. A low landing success rate relative to the perching success rate indicates failure after contact. This usually means that the drone reached the bar with an unsuitable posture. The full system, which combines the proposed hurdle recognition model with the complete control algorithm, achieves the highest landing success rate. By contrast, YOLOv12n without negative-sample training sometimes drives the drone away from the correct path and increases total completion time. These results show that both model design and dataset design contribute directly to practical perching performance. For the full configuration, the landing success rate is 6/6 (exact 95% binomial CI: 54.1–100%). The most important outcome of
Table 6 is that the proposed hurdle recognition model is best not only in perception metrics but also in end-to-end flight success. This is critical because the purpose of the model is not standalone recognition. Its purpose is to support a stable and safe control loop during the full maneuver.
We also performed an ablation study to isolate the contribution of each major control component. First, even at low close-range speed, the gimbal can lose the target because the hurdle’s gaze angular velocity changes rapidly during descent. When gimbal sweeping is removed, the system cannot reacquire the hurdle quickly. This reduces landing success and increases completion time. Second, removing the y-axis and yaw-alignment process accelerates descent, but it also increases landing failures and retries. In that setting, the drone reaches the bar with a less stable posture. This effect appears as larger time variance and lower success rates. Third, when the perching trajectory is simplified to , rear-looking views occur more frequently. In these views, the camera can observe drone structures such as the landing gear. That increases misrecognition and can drive the drone toward incorrect locations during descent. As a result, the landing success rate drops sharply. These ablation results clarify why the full configuration must combine recognition robustness, active target reacquisition, and posture-stabilizing descent logic. The ablation results clarify that reliable perching requires the perception model, the target-reacquisition mechanism, and the trajectory/alignment logic to work together. Accordingly, the proposed system is evaluated as an integrated perception–control framework rather than as a detector and controller considered separately.
Figure 13 presents a chronophotographic view of the first full-configuration experiment. It shows the drone’s position and attitude changes throughout the maneuver. During the reaching stage, the drone moves quickly toward the hurdle. During the perching stage, it follows the proposed forward-then-backward descent trajectory. A small overshoot caused by inertia is visible immediately after the drone reaches the area above the hurdle. This qualitative result is consistent with the quantitative timing and success-rate data above.
Figure 14a shows an enlarged chronophotographic view of the drone near the hurdle during one of the full-configuration experiments.
Figure 14b shows the 10 s interval immediately after the drone establishes stable contact with the horizontal bar. The visualization overlays eight frames extracted from a 10 s video. It shows that the proposed system can maintain posture on the narrow bar using visual feedback alone, without a separate physical fixing device such as a gripper.
Figure 15 shows the execution period of each algorithm measured during 10 s of autonomous perching operation. The recognition module, including hurdle detection and pose estimation, operates with an average period of 81.54 ms, while the autonomous perching control algorithm operates with an average period of 33.32 ms. These results confirm that the full perception-control loop runs fast enough for onboard deployment in the tested setting.
5.3. Failure Cases
Although the proposed system achieved high success rates in the main experiments, quantitative success rates alone do not fully explain the role of each design choice. For that reason, this subsection presents two representative failure scenarios observed during the ablation study. These cases complement the numerical results above and clarify why false-positive suppression and gimbal sweeping are essential in practice.
During ablation experiments with the standard YOLOv12 model and without negative-sample training, the recognition system occasionally misidentified background structures, especially trees and poles, as hurdle targets.
Figure 16 shows one such case. After losing sight of the real hurdle, the drone abruptly changed direction toward a background tree. This behavior highlights the importance of suppressing false positives through negative-sample training. For safety, the drone was forced to land manually just before impact. This failure mode also explains why precision is prioritized over recall in our evaluation. A temporary missed detection can often be recovered through continued flight or gimbal sweeping. A confident false detection, in contrast, can drive the drone toward a dangerous target.
The ablation study without gimbal sweeping revealed a clear limitation in target reacquisition.
Figure 17 shows a drone wandering above the hurdle after losing visual contact. Near the hurdle, the target can move rapidly in the camera image because its apparent angular velocity becomes large. Without gimbal sweeping, the system cannot quickly recover a target that leaves the field of view. This problem is especially severe when the drone is already close to the hurdle, because small vehicle motions then produce large image-space shifts. In this failure mode, the drone remained above the hurdle area without completing the perching sequence. The gimbal sweeping mechanism is therefore essential for recovering from temporary tracking loss and continuing the maneuver.
Taken together with the earlier accuracy, flight-test, and timing results, these failure cases complete the system-level argument of the Results section. The proposed framework is not only accurate and effective but also executable in real time on the onboard computing platform used in the experiments, while the ablations clarify the practical consequences of removing key design elements.
6. Conclusions
This paper presented an onboard perching system that enables a quadrotor UAV to perch autonomously on the horizontal bar of a markerless hurdle in an uncooperative outdoor environment. The proposed system integrates perception, post-processing, and control within a single framework that runs in real time on resource-constrained onboard hardware.
The experimental results support the effectiveness of this integrated design. In the small-hurdle regime, the proposed perception model improves hurdle detection precision and recall by 12.2 and 10.2 percentage points, respectively, relative to the YOLOv12n baseline. In outdoor flight tests, the full system achieves 6/6 landing success. The perception module runs with an average period of 81.54 ms on a Jetson Orin NX. The ablation studies further confirm the roles of negative samples, the P2 path, gimbal sweeping, the perching trajectory, and alignment control. Taken together, these results show that a lightweight integrated perception–control framework can enable practical markerless hurdle perching on onboard hardware and remain feasible in real outdoor flight on thin 3D structures under realistic sensing constraints.
Despite these results, several limitations remain. First, many control parameters still depend on empirical tuning for the F450-based platform. This may limit reproducibility across different platforms, payloads, sensor configurations, and vehicle dimensions. Because the proposed controller sends high-level velocity commands through PX4, moderate changes in total vehicle mass and weak variations in overall vehicle dimensions are expected to have limited influence on the upper-level control logic when the underlying PX4 velocity controller is well tuned. However, geometric quantities that enter the control law explicitly, such as the distance between the gimbal camera and the drone center, may be more sensitive to platform changes. Wind and temperature effects may also be partly absorbed by the lower-level PX4 controller, but the quantitative sensitivity of the full system to these factors has not yet been analyzed. Second, the current experiments do not yet cover a sufficiently broad range of illumination changes, wind disturbances, cluttered backgrounds, obstacle conditions, and dynamic targets for a full robustness and generalization assessment. In particular, performance on moving targets, including slowly moving ones, has not been directly validated. Third, both the hurdle dataset and the experimental scale remain limited relative to what would be needed for stronger statistical and domain-generalization claims. As a result, the manuscript does not yet provide sufficient evidence of robustness and generalization to support stronger deployment-oriented claims. In particular, six flight tests per condition are not sufficient for definitive statistical validation of robustness and generalization.
Future work should therefore focus on three directions. First, manual parameter dependence should be reduced through systematic sensitivity analysis and by quantifying how performance changes with vehicle weight, payload, and geometric variations. Second, flight validation should be expanded with more trials per condition and with more diverse environments, including varying illumination, wind disturbances, complex visual backgrounds, obstacle conditions, and slowly moving targets. Third, the dataset should be enlarged to include more diverse structures and appearances so that model generalization can be assessed more rigorously. Extensions to more complex 3D structures, non-rigid targets, and multi-drone perching scenarios are also promising directions.
Overall, the proposed system shows that markerless outdoor perching on a thin hurdle bar is feasible with a lightweight onboard perception–control pipeline. The methods and experimental findings provide a practical basis for future UAV perching systems intended for long-endurance missions such as infrastructure inspection, surveillance, and autonomous charging or resting operations.