N-Cameras-Enabled Joint Pose Estimation for Auto-Landing Fixed-Wing UAVs

: We propose a novel 6D pose estimation approach tailored for auto-landing ﬁxed-wing unmanned aerial vehicles (UAVs). This method facilitates the simultaneous tracking of both position and attitude using a ground-based vision system, regardless of the number of cameras (N-cameras), even in Global Navigation Satellite System-denied environments. Our approach proposes a pipeline consisting of a Convolutional Neural Network (CNN)-based detection of UAV anchors which, in turn, drives the estimation of UAV pose. In order to ensure robust and precise anchor detection, we designed a Block-CNN architecture to mitigate the inﬂuence of outliers. Leveraging the information from these anchors, we established an Extended Kalman Filter to continuously update the UAV’s position and attitude. To support our research, we set up both monocular and stereo outdoor ground view systems for data collection and experimentation. Additionally, to expand our training dataset without requiring extra outdoor experiments, we created a parallel system that combines outdoor and simulated setups with identical conﬁgurations. We conducted a series of simulated and outdoor experiments. The results show that, compared with the baselines, our method achieves 3.0% anchor detection precision improvement and 19.5% and 12.7% accuracy improvement of position and attitude estimation. Furthermore, these experiments afﬁrm the practicality of our proposed architecture and algorithm, meeting the stringent requirements for accuracy and real-time capability in the context of auto-landing ﬁxed-wing UAVs.


Introduction
Unmanned aerial vehicle (UAV)-based systems have recently gained prominence as highly efficient solutions across various application domains.Among the pivotal functionalities required for UAV operations, autonomous landing stands out as a critical capability.Nonetheless, achieving autonomous landing still presents formidable technical challenges.In addition to the intricate control aspect, which has been widely acknowledged as a challenging problem [1,2], the accurate state estimation of the aircraft poses another formidable task.For fixed-wing aircraft, compared to rotary-wing aircraft, landing is additionally complicated by the non-zero airspeed at the moment of touchdown.This circumstance causes the necessity to generate and realize control actions in a very short time.The important basis for the formation of these control actions is the pose (position and attitude) of the aircraft.
There are two alternatives for the aircraft's pose estimation.In the first one, onboard sensors provide the required information; in particular, the Global Navigation Satellite System (GNSS) combined with inertial navigation systems (INS).The second alternative relies on ground equipment to estimate the aircraft's pose, which is transmitted to the aircraft.The onboard approach often relies on stable satellite signals [3,4].In addition, Drones 2023, 7, 693 2 of 18 conventional INS can be sensitive to magnetic fields in the environment and are influenced by temperature variations.Traditionally, addressing this issue necessitates the integration of supplementary sensors like visual navigation systems to enhance the aircraft's pose estimation during the landing process [5].The offboard approach allows for significant reduction in the complexity and cost of the aircraft.This paper prefers the second option and proposes a pose estimation method for an aircraft's auto-landing based on a ground vision system, regardless of the number of cameras.A parallel ground vision system that combines outdoor and simulated setups with identical configurations is established for the method performance evaluated.The results indicate a significant improvement in pose estimation accuracy compared to state-of-the-art methods.In addition, the results show that the proposed architecture and method satisfy the stringent requirements for accuracy and real-time capability in the context of auto-landing fixed-wing aircraft.Compared with our previous work [6], which only focuses on the position estimation of the UAV, this paper aims to realize a joint estimation for the position and attitude without dependence on the number of cameras.In summary, the contributions of this paper are as follows: (1) Independence of the Number of Cameras: Our proposed pose estimation method is versatile and compatible with ground vision systems utilizing any number of cameras, whether it is a monocular vision system (N = 1) or a stereo vision system (N > 1).Generally, the inclusion of more cameras enhances the system's resilience to measurement errors, such as anchor detection inaccuracies and pan-tilt unit (PTU) attitude errors.(2) Elimination of Excessive Outdoor Data Requirement: Traditional approaches often entail an extensive and labor-intensive process of outdoor fixed-wing landing experiments to collect essential data, incurring significant time and resource costs.Our method, however, achieves accurate and robust outdoor UAV anchor detection by utilizing just 730 frames of data from two outdoor landings.This approach provides a viable solution for scenarios characterized by high outdoor experimental costs.(3) Robust and Accurate Pose Estimation: Autonomous landing necessitates quick and dynamic movement of the UAV in a three-dimensional space, resulting in rapid changes in visual appearance, imaging backgrounds, and more.These spatial and temporal variations present formidable challenges to ground vision-based UAV pose estimation.Our method adeptly addresses these challenges, enabling more precise and robust anchor detection and pose estimation than state-of-the-art methods.This improvement has been validated through the replacement of the onboard GPS-INS positioning system with our method as the sole source of position and attitude data during the outdoor UAV auto-landing process.(4) Simulated and Real Auto-landing Dataset: We have constructed a comprehensive dataset comprising eight simulated and four real landing videos, complete with labels such as target bounding boxes, anchors, and ground truth UAV pose information.This dataset, encompassing diverse conditions including varying wind directions and landing paths, serves as an invaluable resource for UAV detection and pose estimation research.
This paper is organized as follows: Section 2 provides a review of different options for aircraft pose estimation and analyzes their capabilities.In Section 3, we present the problem of UAV pose estimation based on ground vision.The module details, including CNN training, anchors detection, and 6D pose estimation, are then presented in Section 4. In Section 5, three simulated experiments and two outdoor experiments are presented to validate the feasibility, real-time capability, and the robustness of the proposed pose estimation method.The paper is finally concluded in Section 6.

Related Works
The GNSS and INS-integrated navigation system is still the most common means for aircraft pose estimation during auto-landing.However, due to weather factors and multiwave effects in the landing area, the risk of safe landing accidents for aircraft significantly Drones 2023, 7, 693 3 of 18 increases when GNSS signals are occasionally interrupted or continuously disturbed.To improve the robustness of the auto-landing system, a GNSS-independent auto-landing guidance system has attracted researchers' attention.The early systems usually used radio, radar, or lidar to measure the distance between the aircraft and the landing area.Thales, a company in France, built a radio-based system to assist in UAV auto-landing on deck.This system was deployed on a H-6U "Bird" rotary-wing aircraft and a French Navy "Raphael" class frigate, completing a series of technical verifications such as long-range alignment and the mutual measurement of moving platforms.Yang et al. [7] provided a comprehensive review and analysis about radio frequency-based position estimation technology for aircraft.In 1999, The Swiss aerospace company RUAG developed the Object Position and Tracking System (OPATS).It uses laser measurement technology to measure the position and angle of drones and can support the guided landing of Swiss Air Force "patrol" drones.Kim et al. [8] used a lidar mounted on a ground vehicle and realized a lidar-guided aircraft auto-landing on the ground vehicle.The radar-based guidance system is common in both military and civilian fields.In 1996, Sierra Nevada Corporation (SNC) established a millimeter wave radar-based universal automatic recovery system for aircraft for the US military.Pavlenko et al. [9] proposed a 24 GHz secondary radar sensors-based aircraft localization method.By dropping several active beacons, the position is estimated via distance and angle information to the deployed beacons.Most of the above systems have been well applied, especially in military fields, which shows great accuracy and robustness.However, most of them are sensitive to magnetic fields and smog in the environment.Furthermore, some onboard auxiliary devices are necessary, which means modifications to the aircraft system are required.
Visual navigation systems offer an attractive alternative capable of mitigating drift issues by fusing prior knowledge with real-time data.The integration of a vision system augments the amount of environmental information accessible and enhances the robustness of self-state estimation.Using an onboard or offboard camera, the vision navigation system provides aircraft with accurate and real-time self-states, such as visual odometry (VO) [10], and visual simultaneous localization and mapping (VSLAM) [11].Therefore, visual navigation is emerging as a viable auto-landing solution due to its intrinsic ability to incorporate rich environmental information.
Onboard vision: For the autonomous landing of rotary-wing aircraft, the aircraft's pose is often estimated by detecting the markers painted on the static platform utilizing an onboard camera [12,13].For more complex scenarios such as landing on a ship deck, Wang et al. [14] realized that the autonomous landing of a Parrot AR Drone on a vessel deck platform only relies on onboard sensors.They simulated the movement of a ship deck with an attitude-programmable plate.Landing on a moving target [15][16][17] is more challenging compared with ship landing in terms of localization, trajectory planning, and control.For fixed-wing UAVs, however, it is challenging to track the ground marker throughout the entire landing process.This is because, unlike a rotary-wing aircraft, a fixed-wing aircraft is unable to hover.The landing of fixed-wing UAVs is further challenged [18] because even small errors in the guidance system may lead to system damage.The onboard vision was often used to detect the runway [19] and to estimate the relative aircraft's pose to the runway for autonomous landing.However, runway detection is often sensitive to the change in runway appearance.More importantly, for a successful landing, the closer the UAV is to the ground, the higher the accuracy requirement of the UAV pose.Nevertheless, when the UAV approaches the ground, the limited onboard field of view makes it difficult to obtain comprehensive visual information on the runway, which affects the accuracy of pose estimation.Without runway detection, the landmarks on the platform are tracked by the aircraft to provide information on the relative poses [20].For the runway or landmarks, achieving accurate, robust, and real-time detection often requires abundant computing and storage resources, which is often unattainable for small UAVs.In addition, although an onboard solution can directly provide estimated pose for a control system without wireless communication support, it usually requires modification of the aircraft itself, which is not feasible in many application scenarios such as military fields.
Ground vision: An alternative to the onboard vision-based system is the ground visionbased guidance system.It estimates the UAV pose and the pose data is then transmitted to the UAV for auto-landing control.Generally, the ground systems are equipped with a vast computational capacity which enables real-time pose estimation.In addition, using groundbased systems also reduces the load of onboard processing resources, which is often limited, especially for small aircraft.Furthermore, onboard systems such as the barometer and inertial measurement unit (IMU) are significantly sensitive to temperature and magnetic field variations, which does not affect the UAV pose estimation by the ground vision system.Y. Gui [21] proposed a relay guidance scheme to land a UAV by placing three groups of cameras on both sides of the runway so that the total field of view of the cameras covers the whole landing area.The AUTOLAND project [22] focused on the solutions that enable the autonomous landing of a fixed-wing aircraft on a Fast Patrol Boat.The ground monocular vision system has been tested to generate the relative pose of the aircraft concerning the camera.Relying on the ground stereo vision system, a saliency-inspired method [23] and a cascaded deep learning model [24] were proposed and developed to detect and track the aircraft in the images and then used the Extended Kalman Filter (EKF) proposed in our previous work [6] to estimate the position of the aircraft.Paying attention to the ground stereo vision-based fixed-wing aircraft detection and localization for autonomous landing, we design a ground stereo vision guidance system for validation.Unlike the multiple camera groups configuration in work [21], a pan-tilt rotary system was built to extend the ground camera's field of view by controlling the rotary to track the landing aircraft.Offline and online experiments demonstrate the feasibility and robustness of the proposed system [6,25].Only focusing on the 3D position estimation of the aircraft, the above works explored vision-based guidance system schemes including monocular vision and infrared stereo vision.
The integration of multiple sensors utilizes the advantages of different types of sensors, thereby improving the adaptability to the environment.T. Nguyen et al. [26] built a system combining an ultra-wideband (UWB) ranging sensor with a camera to localize the aircraft by using the distance and relative displacement measurements.A vision/radar/INSintegrated shipboard landing guidance system was developed [20].This system consisted of an onboard camera/INS-based motion estimator and an offboard radar-based relative position generator.X. Dong et al. [27] proposed an integrated UWB-IMU-Vision framework for autonomous approaching and landing of aircraft.Using simulated and real-world experiments in extensive scenes, the proposed scheme satisfied the accuracy requirement of auto-landing.
In addition to the position, attitude also plays an important role in the fixed-wing aircraft landing guidance and control system [28].For accurate attitude estimation, the INS is commonly used [29].Aided by the onboard inertial sensors, Yang et al. [29] proposed a bioinspired polarization-based attitude and heading reference system to self-determine the heading orientation in GNSS-denied environments.However, the INS measurement is often affected by internal or external factors such as drift, magnetic field, and temperature variation [30].To improve robustness to environmental factors, the vision system is well applied [10,31].To realize the tanker-UAV relative pose estimation during aerial refueling, Mammarell et al. [31] combined GPS and a machine vision-based system for a reliable estimation, where at least one order of magnitude improvement was achieved by using the EKF instead of other fusion algorithms.Using an off-board camera, an EKF with a nonlinear constant-velocity process model was proposed to estimate position and attitude for rotary-wing aircraft [32].
In this research, taking inspiration from advancements in pose estimation techniques for humans [33], human heads [34], and rigid objects [35], we have developed a ground vision system-based fixed-wing aircraft 6D pose estimation method that is independent of the number of cameras and boasts exceptional accuracy, robustness, and real-time performance.However, existing state-of-the-art methods are primarily tailored for the pose estimation of slow-moving objects in a single-frame image.In a parallel vein, an attitude estimation method for fixed-wing aircraft is introduced [28], also leveraging a ground vision system and validating its accuracy and real-time capabilities through experiments.In contrast to these approaches, our method excels in that it not only estimates attitude but also jointly determines the position and attitude of the aircraft.Furthermore, during our outdoor experiments, the aircraft accomplished successful autonomous landings in environments where GPS and INS data were unavailable, highlighting that the onboard autonomous landing control system solely relied on our ground-based aircraft pose estimation.

Problem Formulation
Accurate pose estimation and flight control are the main two challenges for the autonomous landing of UAVs.This paper focuses on the first component, which is estimating the UAV poses {P u , A u } with high accuracy and strong robustness according to the ground sensor data.To guarantee that the UAV remains in the camera's field of view throughout the entire landing, the camera is mounted on a pan-tilt unit (PTU), and the PTU has the ability to automatically search and track the UAV.Therefore, in addition to the images I, the PTU attitudes A P are also included in the ground sensor data.
In general, the complete procedure of the pose estimation includes obtaining the object region of interest (ROI) first by object detecting, followed by detecting the object's features and estimating the object's poses.The procedure of the proposed pose estimation algorithm is summarized as the following mapping: where P a denotes the system parameters gained by offline calibration.
In computer vision, the commonly extracted features are points, lines, and planes.During auto-landing, there are several points of the UAV that are always in the field of view and remarkably distinctive.Since the anchors' distribution has a significant impact on the pose estimation accuracy, the selection of the anchors considers their characteristics and distribution (for details, see Section 5).Here, we consider the following five UAV anchors: the endpoints of the left wing (LW), right wing (RW), left tail (LT), right tail (RT), and the front tripod (FT), as shown in Figure 1.
In this research, taking inspiration from advancements in pose estimation techniques for humans [33], human heads [34], and rigid objects [35], we have developed a ground vision system-based fixed-wing aircraft 6D pose estimation method that is independent of the number of cameras and boasts exceptional accuracy, robustness, and real-time performance.However, existing state-of-the-art methods are primarily tailored for the pose estimation of slow-moving objects in a single-frame image.In a parallel vein, an attitude estimation method for fixed-wing aircraft is introduced [28], also leveraging a ground vision system and validating its accuracy and real-time capabilities through experiments.In contrast to these approaches, our method excels in that it not only estimates attitude but also jointly determines the position and attitude of the aircraft.Furthermore, during our outdoor experiments, the aircraft accomplished successful autonomous landings in environments where GPS and INS data were unavailable, highlighting that the onboard autonomous landing control system solely relied on our ground-based aircraft pose estimation.

Problem Formulation
Accurate pose estimation and flight control are the main two challenges for the autonomous landing of UAVs.This paper focuses on the first component, which is estimating the UAV poses {Pu, Au} with high accuracy and strong robustness according to the ground sensor data.To guarantee that the UAV remains in the camera's field of view throughout the entire landing, the camera is mounted on a pan-tilt unit (PTU), and the PTU has the ability to automatically search and track the UAV.Therefore, in addition to the images I, the PTU attitudes AP are also included in the ground sensor data.
In general, the complete procedure of the pose estimation includes obtaining the object region of interest (ROI) first by object detecting, followed by detecting the object's features and estimating the object's poses.The procedure of the proposed pose estimation algorithm is summarized as the following mapping: where Pa denotes the system parameters gained by offline calibration.
In computer vision, the commonly extracted features are points, lines, and planes.During auto-landing, there are several points of the UAV that are always in the field of view and remarkably distinctive.Since the anchors' distribution has a significant impact on the pose estimation accuracy, the selection of the anchors considers their characteristics and distribution (for details, see Section 5).Here, we consider the following five UAV anchors: the endpoints of the left wing (LW), right wing (RW), left tail (LT), right tail (RT), and the front tripod (FT), as shown in Figure 1.
As depicted in Figure 1, through offline training based on the training dataset D T , the initial network Net evolves to be the network Net D , which gains the ability to detect the anchors.The operator FD(•) detects the anchors and obtains the anchor locations An in image I.This is mainly performed by the anchor detection network Net D .The final operator FE(•) estimates the UAV poses {P u , A u } according to the anchor locations An, system parameters P a, and real-time PTU attitude A p .
The right part of Figure 1 displays the involved coordinate frames in the autonomous landing system.The objective of the proposed algorithm is to estimate the quickly varied transformation between the world coordinate frame and the UAV body coordinate frame.For the left and right PTU coordinate frames, their origins in the world coordinate frame are known and constant.Since the camera is fixed on the PTU, the transformation between the camera coordinate frame and the corresponding PTU coordinate frame is also constant.

Methodology
This part provides a detailed description of the three operators: anchors detection operator FD(•), CNN training operator FT(•), and 6D pose filtering operator FE(•).The design of a Block-CNN for anchor detection is first given.Then, the details of training data generation and network training are presented.An EKF-based 6D pose estimation algorithm is finally described.

Anchor Detection Operator FD(•)
Anchor detection is one of the core operators of the proposed pose estimation algorithm and its accuracy directly affects the pose estimation accuracy.Conventional anchor detection methods are often based on classical feature points such as Scale Invariant Feature Transform (SIFT) [36] and Oriented FAST and Rotated BRIEF (ORB) [37].These handcrafted representations are, however, suboptimal compared to statistically learned features.The CNN learns the features and, therefore, archives significant advantages in computer vision applications in terms of accuracy and robustness [38].One of the disadvantages is that the training of the CNN is computationally expensive and often needs GPU-like Compute Unified Device Architecture.This is, however, an essential disadvantage that limits the applications of CNNs [39].In contrast, such computational resources can be easily made available on the ground for vision-based applications.This justifies the use of a CNN for accurate and robust anchor detection in ground vision-based systems.
Figure 2 illustrates the designed CNN: F with a 36 × 36 input and a 10 × 1 vector output.In general, the deeper the convolutional layer the greater the accuracy.In practice, the trade-off between accuracy and real-time capability needs to be incorporated into the final design.Considering the above factors, we employed 4 convolution layers and 1 fully connected layer.The output vector represents the positions of the five anchors in the image.Conventionally, the anchors' positions are estimated by the network F and can be directly used for estimating the next pose.Here, to improve the detection accuracy, we introduced the block strategy.Through partitioning the ROI, more pixel details are preserved after resizing as the network input, which helps the network extract more useful features.On the other hand, the block strategy enables the repeated detection of the same anchors.Using score-weighting averaging on the repeated detection results, the outliers' negative impact on detection accuracy is also reduced.
image.Conventionally, the anchors' positions are estimated by the network F and can be directly used for estimating the next pose.Here, to improve the detection accuracy, we introduced the block strategy.Through partitioning the ROI, more pixel details are preserved after resizing as the network input, which helps the network extract more useful features.On the other hand, the block strategy enables the repeated detection of the same anchors.Using score-weighting averaging on the repeated detection results, the outliers' negative impact on detection accuracy is also reduced.Since each anchor was distributed in a relatively fixed region of the ROI, several blocks are obtained by dividing the ROI.Each block contains some of the five anchors.As shown in Figure 2, two blocks are cut from the original ROI and then resized to be the size of 36 × 36.The L-Block contains the anchors LW, LT, and FT.Also, the anchors RW, RT, and FT are within the R-Block.Another two networks (L and R) have almost the same structure as network F, and they are used to detect part of the anchors in the R-Block and L-Block, respectively.Compared with F, the only difference in the structure is that the outputs are six-dimensional vectors.To promote detection accuracy, the anchors' locations in the ROI are then obtained by computing the score-weighted average of the Block-CNN outputs as described below.
Considering the movement continuity of the UAV, the location relationships among all of the anchors remain almost constant.For example, the tail anchors cannot move below the tripod anchor in the ROI.Therefore, we first set up the following constraints: where (u, v) denotes the image coordinates and their upper and lower indexes indicate the network and anchor categories, respectively.The outputs of the networks are ignored Since each anchor was distributed in a relatively fixed region of the ROI, several blocks are obtained by dividing the ROI.Each block contains some of the five anchors.As shown in Figure 2, two blocks are cut from the original ROI and then resized to be the size of 36 × 36.The L-Block contains the anchors LW, LT, and FT.Also, the anchors RW, RT, and FT are within the R-Block.Another two networks (L and R) have almost the same structure as network F, and they are used to detect part of the anchors in the R-Block and L-Block, respectively.Compared with F, the only difference in the structure is that the outputs are six-dimensional vectors.To promote detection accuracy, the anchors' locations in the ROI are then obtained by computing the score-weighted average of the Block-CNN outputs as described below.
Considering the movement continuity of the UAV, the location relationships among all of the anchors remain almost constant.For example, the tail anchors cannot move below the tripod anchor in the ROI.Therefore, we first set up the following constraints: where (u, v) denotes the image coordinates and their upper and lower indexes indicate the network and anchor categories, respectively.The outputs of the networks are ignored if they do not satisfy the above constraints.For the frame k, the final FT location (u FT , v FT ) is the average of all the networks outputs: Drones 2023, 7, 693 8 of 18 Since three networks are used in this paper, the value of p is 3. Another 4 anchor locations are also predicted first, according to the historical anchor locations at step k − 1, hence Therefore, the final LW location computing steps are: Another three anchor locations are also computed following the same steps as above.

CNN Training Operator FT(•)
This module consists of training data generation and network training.Generating the data in the simulated system significantly improves data production efficiency due to the high labor and time costs of conducting outdoor landing experiments.Considering the great simulation performance of Gazebo for UAV dynamic characteristics during landing, we construct a Gazebo-based simulated environment following the same configuration as the outdoor environment.Using this method, data under different conditions such as different wind directions and different landing paths are efficiently generated.Furthermore, since the states of all objects, including UAV, PTUs, cameras, and involved parameters are known accurately in the simulation, the data supports autonomous labeling, and almost no manual labeling is needed.
Here we define the loss of network training by the Euclidean distance between the estimation and the ground truth.The Stochastic Gradient Descent (SGD) is employed to be the optimizer for the training.Before training, the samples were shuffled and every 10 samples were packed into a batch.We then train the network on a PC with 2 GPUs (GTX 3070).The network is trained with the maximum number of iterations of 40 k and an initial learning rate of 0.03, which is decreased by 10 at every 15 k iterations, and we use the weight decay of 0.0001 in the training process.

6D Pose Estimation Operator FE(•)
This module aims to recover UAV spatial pose from several anchors.The Perspective-N-Points (PNP) problem solution for monocular vision and the triangulation method for stereo vision are commonly used for the above problem.However, it does not consider the sensor data noise and takes advantage of the historical UAV states.To improve the robustness of the measurement error, such as the anchors' detection error, we establish an EKF to estimate the UAV position (x, y, z) and attitude (Euler angle (ψ, φ, θ)) in the world coordinate frame.
Let state x be defined as: .
z) and (w ψ , w φ , w θ ) are linear and angular velocities, respectively.The state x at step k is predicted by the process model F k : Drones 2023, 7, 693 9 of 18 where ∆t 3×3 denotes the 3 × 3 diagonal matrix with the diagonal element ∆t defined as the time interval between steps k − 1 and k.The state covariance matrix P is then obtained as: The measurement z contains the detected anchor locations in the images captured by the N cameras: where N is the number of cameras, and each camera provides a set of 3 × 5 measurements of the detected anchor locations.The upper index marks the left and right cameras.In terms of the measurement model h(•), the pinhole camera model is employed to project the anchors into the images according to the predicted state x k|k−1 : where λ is a scaling factor and P U is the anchor locations matrix in the UAV body coordinate frame.Matrix T indicates the homogeneous transformation between the two coordinate frames, which is composed of the rotation matrix R and the translation vector t: and N c is the intrinsic matrix of the camera: where f, u 0 , v 0 , dx, and dy are the camera-intrinsic parameters, which can be obtained in advance through offline calibration.
Due to the nonlinearity of the measurement model h(•), the Jacobian matrix H(•) is: Hence, the Kalman gain K k is: where G is the sensor's Gaussian noise covariance matrix for each measurement.The final step is to update the state x: Drones 2023, 7, 693 10 of 18

Experiments
To validate the proposed algorithm and generate a training dataset, we built a parallel system as shown in Figure 3.This system comprised the guidance system and the fixedwing UAV.The guidance system included two 2-freedom pan-tilt units (PTUs), two cameras mounted on the PTUs, and a laptop with i9-9900k (CPU) and NVIDIA GTX2080 (GPU)).In the outdoor experiment, the PTUs were placed on both sides of the runway with a 10.77 m baseline.The PTU attitude measurement resolution reached 0.00625 degrees, and its highest rotary speed was 50 degree/s.The camera DFK 23G445 (Germany) turned along with the PTU to extend the field of view and generated 640 × 480 pixel video with 60 frames per second.The fixed-wing UAV Pioneer had a wingspan of 2.3 m and a total mass of 14 kg with petrolic propulsion.The simulation followed the same configuration as the outdoor environment, including the UAV model, PTU, and camera parameters.In addition, the real UAV autopilot PX4 was also introduced to establish a hardware-in-loop simulated system and to realize a more realistic flight.
where G is the sensor's Gaussian noise covariance matrix for each measurement.The final step is to update the state x:

Experiments
To validate the proposed algorithm and generate a training dataset, we built a parallel system as shown in Figure 3.This system comprised the guidance system and the fixedwing UAV.The guidance system included two 2-freedom pan-tilt units (PTUs), two cameras mounted on the PTUs, and a laptop with i9-9900k (CPU) and NVIDIA GTX2080 (GPU)).In the outdoor experiment, the PTUs were placed on both sides of the runway with a 10.77 m baseline.The PTU attitude measurement resolution reached 0.00625 degrees, and its highest rotary speed was 50 degree/s.The camera DFK 23G445 (Germany) turned along with the PTU to extend the field of view and generated 640 × 480 pixel video with 60 frames per second.The fixed-wing UAV Pioneer had a wingspan of 2.3 m and a total mass of 14 kg with petrolic propulsion.The simulation followed the same configuration as the outdoor environment, including the UAV model, PTU, and camera parameters.In addition, the real UAV autopilot PX4 was also introduced to establish a hardware-inloop simulated system and to realize a more realistic flight.Online experiments were conducted both in simulated and outdoor environments to validate the performance.A complete landing comprises sloping, flaring, and taxiing, and there are different requirements for pose estimation accuracy in different phases.Therefore, the performances of pose estimation were analyzed for all three phases.For autonomous ROI extraction,  was employed to provide the ROI of the UAVs.The dataset that was used for offline anchor detection and ROI extraction training was generated from 5 simulated (1610 frames) and 2 outdoor (730 frames) landings.Experimental results show that YOLO-v4 achieved 97.2% UAV ROI detection accuracy in our landing scenes.Almost all of the misdetections occurred in the taxiing period since the background on the ground was significantly more complex than the sky.

Anchor Detection
To validate the anchor detection accuracy through the training data augmentation, three different training datasets were used to train the networks.These datasets included the real dataset (RD), the simulated dataset (SD), and the mixed dataset (MD) combining the RD and SD.We implemented and evaluated two classic anchor detection methods as a baseline for anchor detection experiments: a conventional network (only the F part shown in Figure 2) and KeyPose [41].KeyPose localizes the anchor by predicting heatmaps.As shown in Table 1, three methods were trained using the above three datasets, and nine networks with different parameters were obtained accordingly.The anchor detection error e is defined as: where (u, v) and ( u, v) are the detection and ground truth of the anchor locations in the image frames, respectively.As shown in Figure 4, e indicates the pixel distance between the detection and ground truth of the anchor, and w is the width of the ROI.For each anchor, a detection with e > 5% is regarded as a failure.For a complete test, assuming that the number of the total anchors is M and the failure number is m, the failure rate f is:  Table 1 presents the detection failure rates of the five UAV anchors.Compared with the networks trained by only the real or simulated dataset, the networks trained by the mixed dataset improved detection accuracy for the conventional network (from 4.0% to 2.1% and 7.2% to 6.7%), KeyPose (from 3.1% to 1.9% and 5.5% to 5.5%), and the Block-CNN (from 0.5% to 0.3% and 2.9% to 2.3%).Using the training dataset MD, the Block-CNN realized detection precision rates of 99.7% and 97.7% for the simulated and outdoor tests.In addition, under the premise of using the same training dataset, the Block-CNN achieved 3.5% and 2.5% precision improvement compared with the two benchmarks, respectively.
Figure 4 also displays several detection samples in different landing phases using network F. According to the images captured in the simulated and outdoor environments, the UAV showed the more distinguishable features in the simulated cases.This indicates that the dataset acquired from the simulation was more efficient than that of the outdoor Table 1 presents the detection failure rates of the five UAV anchors.Compared with the networks trained by only the real or simulated dataset, the networks trained by the mixed dataset improved detection accuracy for the conventional network (from 4.0% to 2.1% and 7.2% to 6.7%), KeyPose (from 3.1% to 1.9% and 5.5% to 5.5%), and the Block-CNN (from 0.5% to 0.3% and 2.9% to 2.3%).Using the training dataset MD, the Block-CNN realized detection precision rates of 99.7% and 97.7% for the simulated and outdoor tests.In addition, under the premise of using the same training dataset, the Block-CNN achieved 3.5% and 2.5% precision improvement compared with the two benchmarks, respectively.
Figure 4 also displays several detection samples in different landing phases using network F. According to the images captured in the simulated and outdoor environments, the UAV showed the more distinguishable features in the simulated cases.This indicates that the dataset acquired from the simulation was more efficient than that of the outdoor environment.This is because the experimental conditions in the simulated environment were more controllable.Therefore, generating simulated data not only expanded the dataset but also improved the dataset quality, hence contributing to the improvement of the anchor detection accuracy.
In summary, the experimental results suggest that training dataset augmentation contributes to anchor detection improvement.Furthermore, compared with the two baselines, the proposed Block-CNN achieves significant improvement in the accuracy of anchor detection in UAV landing scenes.

Simulations
Here, we evaluate the pose estimation performance of the proposed algorithm and make comparisons with the PNP solution in the simulated environment.Three methods were tested on the simulated guidance system including the monocular PNP (MP), monocular EKF (ME), and stereo EKF (SE) proposed in this paper.Since MP and ME only need monocular vision, we considered the average results of the two cameras as their final results.
We present the three simulated landing scenes (S 1 , S 2 , S 3 ) that were designed to validate the performance of the proposed algorithm.For each landing scene, three different landing trajectories were designed, which means that a total of 9 simulated experiments were conducted and discussed.The first landing was completed without wind disturbance.To simulate the crosswind in the outdoor environment, a continuous crosswind was created during the second and third landings.Additionally, in the third landing, a going-around process was also simulated, which was common during the actual landing.The estimated pose error in sloping, flaring, and taxiing phases is shown in Figure 5.Each group of error graphs represents the average error of three flight experiments in the same scene.Since the UAV gradually approached the guidance system, the position estimation error of the ME and MP was gradually reduced from the sloping to the taxiing phase.In other words, the ME and MP are sensitive to the distance between the UAV and the camera.On the contrary, the distance almost does not affect the positioning error of the proposed SE; a remarkable positioning error of the MP exceeding 20 m (S 2 at the Xand Y-axis) which is likely to result in the deviation from the runway.In the flaring period, the estimation of the height above the ground (Z-axis) is also important.Except for the SE, the other methods had significant errors along the Z-axis.
The attitude estimation error showed greater volatility than that of the positioning error.The biggest pitch estimation error reached 10 • in the sloping phase of S 2 .It probably caused the UAV to descend too fast to successfully land.A high initial yaw estimation error was also seen in S 2 .However, the errors of SE and ME gradually converged before entering the flaring period, whereas the error of MP was deemed volatile.
Figure 6 displays the RMSE for the three landing scenes (each scene contains three flights with different trajectories).For the conventional method MP, the positioning RMSE reached 18.41 m at the X-axis and 3.67 m at the Z-axis, showing an unacceptable level of accuracy for the landing process.In addition, the yaw RMSEs exceeding 6 degrees in the simulations S 2 and S 3, cannot support a successful landing.By comparison, ME achieved accuracy improvement for both the position and attitude estimation.Furthermore, a significantly remarkable pose RMSE reduction was achieved by the SE.The RMSEs at the X-and Y-axes did not exceed 1.0 m, which ensured that the UAV was completely within the runway range.More importantly, the RMSE at the Z-axis did not exceed 0.1 m.It also laid a key foundation for a successful landing.
error of the ME and MP was gradually reduced from the sloping to the taxiing phase.In other words, the ME and MP are sensitive to the distance between the UAV and the camera.On the contrary, the distance almost does not affect the positioning error of the proposed SE; a remarkable positioning error of the MP exceeding 20 m (S2 at the X-and Yaxis) which is likely to result in the deviation from the runway.In the flaring period, the estimation of the height above the ground (Z-axis) is also important.Except for the SE, the other methods had significant errors along the Z-axis.The attitude estimation error showed greater volatility than that of the positioning error.The biggest pitch estimation error reached 10° in the sloping phase of S2.It probably caused the UAV to descend too fast to successfully land.A high initial yaw estimation error was also seen in S2.However, the errors of SE and ME gradually converged before entering the flaring period, whereas the error of MP was deemed volatile.
Figure 6 displays the RMSE for the three landing scenes (each scene contains three flights with different trajectories).For the conventional method MP, the positioning RMSE reached 18.41 m at the X-axis and 3.67 m at the Z-axis, showing an unacceptable level of accuracy for the landing process.In addition, the yaw RMSEs exceeding 6 degrees in the simulations S2 and S3, cannot support a successful landing.By comparison, ME achieved accuracy improvement for both the position and attitude estimation.Furthermore, a significantly remarkable pose RMSE reduction was achieved by the SE.The RMSEs at the Xand Y-axes did not exceed 1.0 m, which ensured that the UAV was completely within the  In addition to the position estimation, the estimation of yaw performed with lower accuracy compared with roll and pitch, but the yaw RMSE of SE still did not exceed 2.01°.In terms of the real-time capability, the final pose estimation fps was almost the same as the anchor detection module (>30 fps), since the EKF estimator consumed much less time than the networks.The anchors' configuration is critical for improving pose estimation accuracy, such as their spatial distribution and number.Generally speaking, the more detected anchors, the more information the anchors provide, and the lower the sensitivity to the anchor detection error.Therefore, the maximum possible number of anchors should In addition to the position estimation, the estimation of yaw performed with lower accuracy compared with roll and pitch, but the yaw RMSE of SE still did not exceed 2.01 • .In terms of the real-time capability, the final pose estimation fps was almost the same as the anchor detection module (>30 fps), since the EKF estimator consumed much less time than the networks.The anchors' configuration is critical for improving pose estimation accuracy, such as their spatial distribution and number.Generally speaking, the more detected anchors, the more information the anchors provide, and the lower the sensitivity to the anchor detection error.Therefore, the maximum possible number of anchors should be configured.In our algorithm, 5 anchors were configured for the UAV.
To explore the influence of the anchors' configuration on position and attitude estimation accuracy, another two configurations were employed to compare the pose estimation accuracy.As shown in Figure 7, 3 anchors and 8 anchors were configured and the results in simulations S 2 and S 3 are listed in Table 2.According to the RMSE of the position and attitude, it is obvious that the configuration with 3 anchors performed with a drastically lower accuracy in position and attitude estimation compared with the case with 5 anchors.It is also impossible to realize a safe auto-landing with the position RMSE of 5.03 m and 3.38 m at the Z-axis.In other words, the 3-anchor configuration is infeasible.For the 8-anchor configuration, the new 3-anchor addition did not introduce a remarkable enhancement to the estimation accuracy.In summary, for the UAV landing application, the experimental results indicate the superiority of the 5-anchor configuration employed in our algorithm.In addition to the position estimation, the estimation of yaw performed with lower accuracy compared with roll and pitch, but the yaw RMSE of SE still did not exceed 2.01°.In terms of the real-time capability, the final pose estimation fps was almost the same as the anchor detection module (>30 fps), since the EKF estimator consumed much less time than the networks.The anchors' configuration is critical for improving pose estimation accuracy, such as their spatial distribution and number.Generally speaking, the more detected anchors, the more information the anchors provide, and the lower the sensitivity to the anchor detection error.Therefore, the maximum possible number of anchors should be configured.In our algorithm, 5 anchors were configured for the UAV.
To explore the influence of the anchors' configuration on position and attitude estimation accuracy, another two configurations were employed to compare the pose estimation accuracy.As shown in Figure 7, 3 anchors and 8 anchors were configured and the results in simulations S2 and S3 are listed in Table 2.According to the RMSE of the position and attitude, it is obvious that the configuration with 3 anchors performed with a drastically lower accuracy in both position and attitude estimation compared with the case with 5 anchors.It is also impossible to realize a safe auto-landing with the position RMSE of 5.03 m and 3.38 m at the Z-axis.In other words, the 3-anchor configuration is infeasible.For the 8-anchor configuration, the new 3-anchor addition did not introduce a remarkable enhancement to the estimation accuracy.In summary, for the UAV landing application, the experimental results indicate the superiority of the 5-anchor configuration employed in our algorithm.

Outdoor Evaluation
We implemented and evaluated KeyPose combined with classical triangulation (KC) [41] and object triangulation (KO) [41], respectively, as the baseline of the 6D pose estimation.The ground truth of the UAV pose was obtained by the onboard synchronous D-GPS and inertial navigation system (INS).
For auto-landing, the estimation accuracy at the Z-axis is the most important in terms of position estimation, especially in the flaring period.Once the UAV is taxied on the runway, which is usually detected by the landing detector, the UAV controller would not consider the location at the Z-axis.In our outdoor experiments, the runway width was about 10 m.This means that the maximum error at the X-axis in the flaring and taxiing phases should not exceed 5 m.
Three outdoor experiments were conducted for performance evaluation.Since the computer platform on which the algorithms run was the same as the one used in simulation, the real-time capability in simulation (30 fps) was also valid for outdoor evaluation.Figure 8 illustrates the position and attitude estimation results of one of the three outdoor experiments and the pose estimation error.For a successful landing, smooth temporal positioning is a prerequisite.According to the position estimation results, the error curves of KC and KO showed more fluctuations compared with the proposed method SE.For instance, the maximum errors of KC and KO even reached 132 m at the Y-axis and 14 m at the X-axis.This can easily deviate the UAV from the 10 m width runway.On the contrary, the maximum error of SE at the X-axis did not exceed 5 m, which was necessary to ensure that the aircraft was always within the range of the runway in flaring and taxiing phases.For the algorithms KC and KO, there was a higher error at the Y-axis than that of the other axes.This is because the Y-axis has a high degree of coincidence with the direction of the cameras' optical axis.This makes the positioning at the Y-axis more sensitive to measurement errors.The details have been discussed in our previous work [6].For the attitude estimation, the results of the three methods showed comparable temporal fluctuation.The images and detected anchors from the two cameras are also shown at points A, B, and C. The red and yellow anchors are ground truth and detection results, respectively.It is seen that the anchor was out of the field of view in the left camera at point C.This caused a remarkable Z-axis positioning error.Table 3 shows the RMSE of the three outdoor experiments in the sloping, flaring, and taxiing, respectively.The RMSE in the sloping phase is higher than that of the other two landing phases.This is because the UAV was the farthest away from the cameras in the sloping phase, and our previous work [6] has demonstrated that for a given detection error, the further the UAV is from the cameras, the greater the location error.In addition, the roll and yaw RMSEs of SE in the sloping phase reached 21.0°, 19.3°, and 17.8° and 11.5°, 10.3°, and 11.8°, respectively, which were even higher than that of the triangulation solution (16.2°, 18.5°, and 19.9° and 7.8°, 7.6°, and 8.5°).The reason is that the initial UAV pose of the proposed method SE was roughly estimated, and it needed to take several steps to converge.The RMSEs of the height estimation (at the Z-axis) in the flaring period by the triangulation solution reached 1.1 m, 1.4 m, and 1.6 m, thus leading to a failed landing.In contrast, the SE results did not exceed 0.5 m, thus satisfying the safe landing requirement.Table 3 shows the RMSE of the three outdoor experiments in the sloping, flaring, and taxiing, respectively.The RMSE in the sloping phase is higher than that of the other two landing phases.This is because the UAV was the farthest away from the cameras in the sloping phase, and our previous work [6] has demonstrated that for a given detection error, the further the UAV is from the cameras, the greater the location error.In addition, the roll and yaw RMSEs of SE in the sloping phase reached 21.0 • , 19.3 • , and 17.8 • and 11.5 • , 10.3 • , and 11.8 • , respectively, which were even higher than that of the triangulation solution (16.2 • , 18.5 • , and 19.9 • and 7.8 • , 7.6 • , and 8.5 • ).The reason is that the initial UAV pose of the proposed method SE was roughly estimated, and it needed to take several steps to converge.The RMSEs of the height estimation (at the Z-axis) in the flaring period by the triangulation solution reached 1.1 m, 1.4 m, and 1.6 m, thus leading to a failed landing.In contrast, the SE results did not exceed 0.5 m, thus satisfying the safe landing requirement.In summary, the outdoor experimental results show that, compared with the two benchmarks KC and KO, the proposed SE achieved 19.1% and 19.9% (average 19.5%) accuracy improvement in position, and 12.3% and 13.0% (average 12.7%) accuracy improvement in attitude, respectively.The outdoor experiments also confirmed the feasibility of the proposed solution SE to provide the real-time poses of the UAV during autonomous landing.

Conclusions
We have introduced a UAV pose joint estimation algorithm for autonomous guidance, enabled by an N-cameras (N ≥ 1) ground vision system.This approach involves constructing a pipeline that combines CNN-based anchor detection and anchors-driven pose estimation.To enhance anchor detection, we have introduced a Block-CNN learning-based detection algorithm, leveraging a blocking mechanism that significantly improves accuracy and robustness.Given the high cost associated with outdoor experiments, we have devised a parallel system that includes both simulated and outdoor environments, sharing the same configuration, to augment the training dataset via simulated experiments.The actual pose estimation is achieved through an EKF estimator that uses the detected anchor locations.Our simulation and experiments show that our method achieves 3.0% anchor detection precision improvement and 19.5% and 12.7% accuracy improvement of position and attitude estimation, compared with other state-of-the-art methods.Furthermore, these experiments affirm that our algorithm satisfies the stringent landing navigation requirements in terms of accuracy, real-time capability, and robustness, an essential prerequisite for continuous UAV pose estimation is maintaining the UAV within the field of view of the ground camera.Hence, accurate and stable servo tracking of pan-tilt units is crucial for successful auto-landing.However, in our outdoor experiments, we occasionally encountered issues with failed servo tracking, leading to experimental failures.To enhance the stability of our ground vision system, we will focus on addressing the servo tracking problem in our future work.Additionally, our future endeavors will encompass exploring special scenarios, including varying weather conditions and complex backgrounds.
We conducted a series of simulated and outdoor experiments, and the results show that, compared with the baselines, our method achieves 3.0% anchor detection precision improvement and 19.5% and 12.7% accuracy improvement of position and attitude estimation.

Figure 1 .
Figure 1.The diagram of the proposed algorithm is composed of the training operators FT(•), anchors detection operator FD(•), and pose estimation operator FE(•).The right trajectory plot shows the involved coordinate frames and three main periods for the auto-landing of the UAV.

Figure 2 .
Figure 2. The Block-CNN architecture for anchor detection.The ROI is cut into 3 sub-ROIs.The anchors are then detected in sub-ROIs using the networks with the same structure.The network consists of 4 convolutional and 1 fully connected layer.The final detection results are generated after the fusion.

Figure 2 .
Figure 2. The Block-CNN architecture for anchor detection.The ROI is cut into 3 sub-ROIs.The anchors are then detected in sub-ROIs using the networks with the same structure.The network consists of 4 convolutional and 1 fully connected layer.The final detection results are generated after the fusion.

Figure 3 .
Figure 3.The parallel system contains the outdoor and Gazebo-based simulated environments.It is composed of the guidance system and fixed-wing UAV, and the outdoor and simulated environments have the same configuration.Online experiments were conducted both in simulated and outdoor environments to validate the performance.A complete landing comprises sloping, flaring, and taxiing, and there are different requirements for pose estimation accuracy in different phases.Therefore, the performances of pose estimation were analyzed for all three phases.For autonomous ROI extraction, YOLO-v4 [40] was employed to provide the ROI of the UAVs.The dataset that was used for offline anchor detection and ROI extraction training was generated from 5 simulated (1610 frames) and 2 outdoor (730 frames) landings.Experimental

Figure 3 .
Figure 3.The parallel system contains the outdoor and Gazebo-based simulated environments.It is composed of the guidance system and fixed-wing UAV, and the outdoor and simulated environments have the same configuration.

Figure 4 .
Figure 4. Detection samples in different landing phases using network F. Once the pixel distance between the detection and ground truth of the anchor exceeds 5%*×w (ROI width), this detection is regarded as a failure.

Figure 4 .
Figure 4. Detection samples in different landing phases using network F. Once the pixel distance between the detection and ground truth of the anchor exceeds 5%×w (ROI width), this detection is regarded as a failure.

Figure 5 .
Figure 5.The landing trajectories in the simulated environment and the pose estimation error for the three simulated landing scenes.Each landing scene contains three different landing trajectories for simulated validation.

Figure 5 .
Figure 5.The landing trajectories in the simulated environment and the pose estimation error for the three simulated landing scenes.Each landing scene contains three different landing trajectories for simulated validation.

Figure 6 .
Figure 6.The RMSE of pose estimation for the three simulated experiments.

Figure 6 .
Figure 6.The RMSE of pose estimation for the three simulated experiments.

Figure 6 .
Figure 6.The RMSE of pose estimation for the three simulated experiments.

Figure 7 .
Figure 7. Another two different anchor configurations.Five-anchor configurations are employed in our algorithm.

Figure 7 .
Figure 7. Another two different anchor configurations.Five-anchor configurations are employed in our algorithm.

Drones 2023, 7 , 19 Figure 8 .
Figure 8.The trajectory estimation results of one of the outdoor experiments and the pose estimation error by KC, KO, and SE.Estimated trajectories in sloping, flaring, and taxiing are depicted.In addition, several exemplary images captured from cameras during auto-landing and the anchor-detection results are also shown.

Figure 8 .
Figure 8.The trajectory estimation results of one of the outdoor experiments and the pose estimation error by KC, KO, and SE.Estimated trajectories in sloping, flaring, and taxiing are depicted.In addition, several exemplary images captured from cameras during auto-landing and the anchordetection results are also shown.

Table 1 .
The failure rates of anchor detection (defined as Formula (27)) using 3 different training datasets.Drones 2023, 7, x FOR PEER REVIEW 12 of 19

Table 2 .
The RMSE in the SE for three-, five-, and eight-anchor cases.

Table 3 .
The RMSE in outdoor performance evaluations.

3 .
The RMSE in outdoor performance evaluations.