Autonomous Target Tracking of UAV Using High-Speed Visual Feedback

. Abstract: Most current unmanned aerial vehicles (UAVs) primarily use a global positioning system (GPS) and an inertial measurement unit (IMU) for position estimation. However, compared to birds and insects, the abilities of current UAVs to recognize the environment are not sufﬁcient. To achieve autonomous ﬂight of UAVs, like birds, the UAVs should be able to process and respond to information from their surrounding environment immediately. Therefore, in this paper, we propose a direct visual servoing system for UAVs, using an onboard high-speed monocular camera. There are two advantages of this system. First, the high image sampling rates help to improve the ability to recognize the environment. Second, the issue of control latency can be effectively solved because the position control signals are transmitted to the ﬂight controller directly. In the experiment, the UAV could recognize a target at update rates of about 350 Hz, and a target tracking task was successfully realized.


Introduction
Unmanned aerial vehicles (UAVs), which are often called drones or multicopters, have become a popular research topic because of their high autonomy and flexibility.To achieve autonomous flight of a UAV, it is important to obtain its position and attitude in real time.In most current UAVs, an inertial measurement unit (IMU) is used to acquire the attitude, and the Global Positioning System (GPS) is used to acquire the position.These signals are then fused by a Kalman filter.However, GPS is not suitable in indoor environments.Even outdoors, it is also unsuitable if there are many obstacles.When the GPS signal is blocked, the UAV easily loses its locational information.On the other hand, flying creatures such as birds and insects use mainly visual information for their flight control [1].In particular, in environments with many obstacles, visual information plays an important role in collision avoidance [2,3].In order to realize completely autonomous flight of a UAV in any environment, a robust and stable visual feedback flight control method is indispensable.
In conventional research on vision systems for UAVs, visual simultaneous localization and mapping (Visual SLAM) and visual odometry (VO) for vision-based navigation are the main topics.The V-SLAM technique constructs a live map of the surrounding environment passed by a UAV, and the UAV localizes itself relative to this constructed map [4,5].VO estimates the egomotion of a UAV using successive captured images instead of constructing a complete environmental map of the UAV [6].These studies focused on the recognition of a 3D environment, and UAVs were controlled based on the recognition result.However, the computational cost of V-SLAM is high, and the processing speed is not very high.
On the other hand, in flying organisms, real-time visual information is used to avoid collisions and fly in narrow spaces.In such cases, it is not necessary to obtain a detailed 3D environment map in world coordinates, and a UAV can fly by visual servoing even if only the relative position and orientation between the UAV's body and the obstacle is obtained.The process of accurate 3D environment recognition is omitted, enabling responsive and quick flight.We think that the performance of such responsive visual servoing in conventional UAV research is still not high enough.There have been studies using simple optical flow sensors [7,8], ultrasonic sensors [9,10], and laser imaging detection and ranging (LIDAR) sensors for self-navigation [11,12].However, there are not many examples using visual servoing control for UAVs.This is due to the low frame rate of conventional vision systems and the heavy computational cost of visual processing.
Our group has developed various types of high-speed vision systems with processing speeds of around 500 Hz to 1000 Hz [13].In this paper, we developed a new lightweight high-speed vision system for UAVs and visual servo control for quick target tracking.The proposed vision system could recognize the surrounding environment in a short time and could immediately transmit control signals to the flight controller.Figure 1 shows the difference between the relative-position-based visual servoing used in this paper and the conventional approach.The relative-position-based visual servoing estimates the relative position and orientation to the target for visual servoing instead of collecting complete environmental information for making a control strategy.This paper is organized as follows: In Section 2, research on UAV control using visual information is described.In Section 3, we introduce the system configuration of our UAV and the moving target used in the experiment.In Section 4, we explain the process of image processing for estimating the relative pose between a target and the UAV.In Section 5, we introduce the principle of the relative-position-based visual servoing and the controller design for a position hold task.In Section 6, first we explain the experimental method and present the experimental results of the position hold task and target tracking in real flight, and then we discuss the experimental results.Finally, we conclude our work in Section 7.

Related Works
A number of studies on controlling UAVs using visual sensors have been conducted in recent years [14].In particular, studies on visual servoing for UAVs are reported in [15].Watanabe et al., proposed a multi-camera visual servo system for controlling unmanned micro helicopters by using visual information [16].Yu et al., presented a visual control method for a small UAV, in which binocular stereo vision and an ultrasonic system are fused to implement an obstacle avoidance strategy [17].Guenard et al., proposed an eye-in-hand visual servo control system for stationary flight of a UAV [18].Teuliere et al., utilized a color-based object recognition method to chase a moving target from a flying UAV [19].Eberli et al., proposed a pose estimation algorithm using one single circular landmark for controlling a quadrotor [20].Jung et al., presented a visual navigation strategy that enables a UAV to fly through various obstacles in complex environments [21].Lee et al., proposed a method of autonomous landing of a VTOL UAV on a moving platform using image-based visual servoing (IBVS) instead of calculating the position of the target [22].Jabbari et al., proposed an adaptive dynamic IBVS scheme for underactuated UAV control [23].Araar et al., proposed and compared two visual servoing schemes based on Hough parameters to track linear structured infrastructures [24].Falanga presented a quadrotor system capable of autonomously landing on a moving platform using an onboard vision sensor for localization and motion estimation of the moving platform [25].Thomas et al., proposed vision-based localization and servoing for a quadrotor to realize autonomous dynamic grasping and perching [26,27].In addition to these studies, for improving the flight performance of UAVs, a number of robust visual servo controller designs have also been presented [28][29][30][31][32].
In previous studies, the highest reported onboard processing rate for the controller strategy was 75 Hz, which is not high enough to realize high-speed autonomous flight of a UAV.To achieve this goal, it is necessary to integrate a high-speed vision system into the UAV.A high-speed vision system is effective for controlling a robot at high speed, and various applications of high-speed vision have been developed [33][34][35][36].Moreover, a high-speed tracking system [37] and a high-speed velocity estimator for UAVs [38] have also been proposed.
Considering the merit of the high noise tolerance of high-speed vision systems and the possibility for high-speed control, in this paper we propose a relative-position-based visual servoing method using an onboard high-speed camera to realize an autonomous target tracking task.

Developed UAV
Figure 2a shows the platform we used in the experiment.For fast motion, we selected a QAV250 fiber frame as the flying platform.For high-speed target recognition, we chose a XIMEA MQ003CG-CM high-speed camera with a Theia SY110M ultra-wide lens.Considering the need for fast image processing and pose estimation, as the onboard companion computer, we chose a Jetson TX2, which is a low-power embedded platform launched by NVIDIA, including a 256-core NVIDIA Pascal GPU, a hex-core ARMv8 64-bit CPU complex, and an 8-GB LPDDR4 memory with a 128-bit interface.As the flight controller, we chose a Pixhawk2.1 Cube as the basic hardware for developing our own flight controller.Figure 2b shows the wiring diagram of our UAV.The high-speed camera transmitted a raw image to the companion computer via a USB3.0 cable.After the relative pose was calculated, the current position and setpoint were transmitted to the flight controller using the mavlink protocol via an FTDI USB to TTL cable.Finally, PPM signals were sent to ESCs for converting them to 3-phase signals to drive the brushless motors.The specifications of our platform are shown in Table 1.
Moreover, the GPU-based Harris corner algorithm was used for detecting features.This algorithm was built on VisionWorks [39], which is a CUDA software development package for computer vision and image processing.

Moving Target
In the experiment, we prepared a sample moving target as a reference for visual servoing to verify the performance quantitatively.For convenience of fast feature detection in the image processing, we attached four LED lights on a plate to serve as the moving target, as shown in Figure 3a.We define the coordinates of each LED as (0, 0, 0), (0.21, 0, 0), (0, 0.15, 0), (0.21, 0.15, 0) from points 1-4 respectively.For moving the target plate, we used a Barrett WAM 7-DoF robot manipulator, as shown in Figure 3b, and moved it with different speeds and accelerations [40].

Vision-Based State Estimator
In this section, we explain the vision-based estimator of the pose of the UAV.This estimator outputs the relative position and orientation with respect to the target.By using the advantage of high-speed vision that objects appear to be stationary, simple high-speed image processing and state estimation is realized.The result is used in the controller explained in the next section.

Feature Detection
We use the Harris corner algorithm [41] for detecting features on the target.This is a general-purpose feature detection algorithm that can be easily applied to other recognition tasks.In addition, it is suitable for high-speed image processing because of its low computational load.However, the detection capability of the Harris corner algorithm is not good compared to other advanced image feature detection algorithms such as SURF [42] and FAST [43].In the case of using high-speed vision, only slightly changes of the target are observed between frames, and temporal continuity can be assumed, so it is easy to prevent false detections.
If the following score R is larger than a threshold value at an image point u(x, y), we define it as a corner: R = detM − k(traceM) 2  (1) where I x and I y are image derivatives in the x and y directions, respectively, w is a window function for giving weights to the pixels underneath, and k is a constant for calculating R.

Image Centroid
At time t, we can classify features into some sections by using the center of all image features, C, as shown in Figure 4. Then the four features are computed as the centroid of all selected features by employing the image moment [44] in the following equation with known features in each section:   3 , we can derive the camera pose O T by employing the Perspective-n-Point (PnP) algorithm [45].The general formulation of PnP is to find the homogeneous transformation matrix O T that minimizes the image reprojection error, as follows: where pn is the reprojection of the 3D point O X n into the image according to the transformation O T.
By using iterative Levenberg-Marquardt optimization [46], we can find such a pose that minimizes the reprojection error of these four points on the target.The pose represents the relative position O t and attitude O R between the target's coordinate system and the camera's coordinate system.
Finally, since we use the relative position for the UAV's position control directly, we applied a smoothing filter to the estimated position O t as follows: where i is the time step, and m is the order of smoothness.In the experiment, we set m to 7. On the other hand, the estimated orientation will be transformed to a quaternion and integrated with the IMU using the extended Kalman filter (EKF) algorithm for the attitude controller.Figure 5 shows the principle of PnP algorithm.Figure 6 shows a flowchart of the high-speed target recognition method in our system.

Relative-Position-Based Visual Servoing
To realize responsive quick motion, we propose relative-position-based visual servoing, which calculates only the pose of the UAV relative to the target.Since only relative positions are necessary for our controller, the computational load of image processing on the onboard companion computer will be effectively decreased and we can get more available visual information in the same time duration.Figure 7 shows the relative-position-based visual servoing method we used in the experiment.

Coordinate Transformation
There are three coordinate systems in our system: the object coordinate system O, the camera coordinate system C, and the UAV's body coordinate system B, as shown in Figure 8a.We can obtain o x and o R directly by PnP.
Considering the possibility of electromagnetic interference, we estimate the relative attitude by using a vision estimator instead of a magnetometer.The relative yaw angle O ψ shown in Figure 8b is calculated by decomposing the relative rotation matrix derived from the PnP algorithm mentioned in the previous section.For the attitude control, the relative yaw angle estimation is used as an observation, and this is integrated with the IMU in the EKF state estimator.

Position Hold Controller Design
We built our visual servoing controller based on the PX4 flight controller architecture [47].Figure 9 shows the control diagram for the position hold flight.To achieve position hold control, first the current vehicle states O x and O ψ are estimated by a vision-based state estimator.Then, the position controller obtains the estimated relative position directly with a known relative attitude setpoint B q sp and thrust setpoint B t sp .Equations for calculating the attitude setpoint B q sp and thrust setpoint B t sp are as follows: The velocity setpoint O v sp is calculated by where K px is the position proportional parameter, and O x sp and O x represent the desired position and the estimated relative position.Then, the thrust setpoint O t sp is calculated by where the velocity error ∆v = O v sp − O v, and K pv and K dv are velocity P and D parameters in the PD controller, O z sp is the desired position in the z direction, and O z ini is the initial position in the z direction, which is calculated by where K iv is the velocity integrative parameter.With a known thrust setpoint O t sp and yaw angle setpoint O ψ sp , we have the desired 3-axis unit vectors: where B x , B y , and B z are the 3-axis unit vectors of the UAV's body frame, ψ xy = − sin ψ sp , cos ψ sp , 0 is the vector of the desired yaw direction in the body frame's xy plane, and Ô t sp is the normalized vector of O t sp .By stacking B x , B y , B z together we obtain the desired rotation matrix B R sp : By transforming B R sp to the quaternion form, we obtain attitude setpoints B q sp .With a known attitude setpoint B q sp and estimated attitude B q from the EKF state estimator, the angular rates setpoints φ sp , θ sp , and ψ sp are calculated with tuned proportional parameters: where K pn , n ∈ φ, θ, ψ are tuned proportional parameters of roll, pitch, and yaw angle, respectively, and e m = m sp − m, m ∈ φ, θ, ψ are the angle errors between the desired attitude and the estimated attitude.In the rate controller, simple PID control is used, and its outputs and B t sp will be translated to the motor PWM signals in the mixer module.Figure 10 shows the flowchart of the relative-position-based visual servoing system.

Method
In the experiment, first we made the quadrotor take off manually, and once the target was in view, the flight mode was automatically switched to the position hold mode.The quadrotor will fly to a setpoint relative to the target original point and keep its position.Since the relative position between the target and quadrotor is used for position-based visual servoing, the quadrotor will keep flying to a setpoint during the target's movement.

Position Hold
Before the experiment for dynamic target tracking, we tested the performance of direct visual servoing in a position holding experiment.The experimental configuration is shown in Figure 11.The setpoint position o x was set to [0.6, 0.1, 0.1] T [m], and the setpoint yaw angle o ψ sp was set to 0 [deg].Note that they were set relative to the original point on the target.
Figure 12 shows the result of an autonomous position hold flight with a target recognition rate of about 350 Hz.The experimental results show that the positioning errors were [30.2, 63.2, 19.0] [mm] with standard deviations [0.64, 1.4, 0.43] [mm] in the x, y, and z directions, respectively.The angular tracking mean error was 2.3291 • with a standard deviation 0.1689 • in the yaw direction.

Trajectory Simulation
For the dynamic target tracking, the trajectory of the target was set beforehand for the convenience of setting the moving speed of the target.In the experiment, the target was moved back and forward between two postures of the robot arm with different moving speeds to verify the tracking results in different conditions.Figure 13 shows the positions set for the tracking tasks.

Results
The tracking trajectories were transformed to the world coordinate system (Figure 14) located on the robot arm, as shown in Figure 15, with the target moving at a linear velocity of 0.5 [m/s] and a linear acceleration of 0.1 [m/s 2 ].The setpoint position o x was set to [0.6, 0.1, 0.1] T [m], and the setpoint yaw angle o ψ sp was set to 0 [deg] relative to the original point on the target.
The experimental results showed that the positional tracking mean errors were [41.4,51.2, 34.5] [mm] with standard deviations [1.2, 1.3, 1.0] [mm] in the x, y, and z directions, respectively.The angular tracking mean error was 0.7006 • with a standard deviation 0.0574 • in the yaw direction.Since the relative setpoint was [0.6, 0.1, 0.1] T [m] in the x, y, and z directions with respect to the object, we can see the position deviation between the UAV's trajectories and the robot end effector during the tracking.
To test the performance of our direct visual servoing system while tracking a high-speed moving target, we repeated the above experiment with higher acceleration.Figure 16 shows the tracking results while the target was moving backward and forward with a linear velocity of 0.5 [m/s] and a linear acceleration of 1.0 [m/s 2 ].The experimental results showed that the positional tracking mean errors were [50.5, 80.1, 129.1] [mm] with standard deviations [1.8, 2.6, 4.9] [mm] in the x, y, and z directions, respectively.The angular tracking mean error was 1.0101 • with a standard deviation 0.0019 • in the yaw direction.As shown in Figure 16, we found that there was an overshot phenomenon while the target was moved with a higher linear acceleration.This may have resulted from an excessively high mass of inertia of our flying platform, which was originally used for a racing drone.
We also compared the tracking performance in the z direction and yaw angle estimation with different target recognition rates, as shown in Table 2.The experimental results showed that higher target recognition rates not only effectively improved the tracking performance but also allowed the accuracy of the attitude estimation to be evaluated.Figure 17 shows the tracking results with a 30 Hz target recognition rate.Compared with Figure 16, a tracking delay in the z direction was obvious due to the larger control latency resulting from insufficient position update rates.Figure 18 shows the experimental result of dynamic target tracking within 1.7 s.

Discussion
In the experiments, the feasibility of direct visual servoing of a UAV was demonstrated by a position hold task and two target tracking tasks with different moving speeds.We used the PnP algorithm to obtain the relative pose of the target, and we assumed that the 3D positions of the features on the target coordinate system were known.However, in actual applications, detection of feature points and estimation of their 3D positions are required at the moment of entering the field of view.This is usually difficult with a monocular camera alone.Nevertheless, tracking and avoidance are often possible even if the relative position is not exactly accurate.In this case, a rough estimate is sufficient.
The measurement errors ∆p of the vision-based position estimator can be described by the relationship of propagation of uncertainty: where ∆u n (n ∈ 1, 2, 3, 4) are errors of four selected features on the image.(16) From ( 17), we can calculate the errors of the estimated relative position ∆p by propagating errors from ∆u 1 , ∆u 2 , ∆u 3 , ∆u 4 .In the experiment, we obtained estimation errors ∆p of relative positions of about [57.3, 125.9, 142.7] T [mm] in the x, y, and z directions, respectively.In addition, we found that the velocity estimation results of the EKF state estimator diverged as the sample rates of the estimated relative position became higher than that of the IMU (250Hz) with such estimation errors.Therefore, though we were able to recognize a target at about 350 Hz, its position information was sent to a flight controller operating at about 120 Hz to ensure a safe flight.To improve the accuracy and decrease the computational load of the state estimator, we will adopt an unscented Kalman filter (UKF) for velocity estimation in the future.

Conclusions
In the presented work, we proposed a relative-position-based visual servoing system employing a high-speed target recognition method and a novel controller design.Compared to previous research on visual servoing of UAVs, our system used visual position information for controlling the UAV directly with high rates, and the feasibility of the proposed control scheme was verified in a position hold flight.The advantages of the proposed method were demonstrated through real-time target tracking experiments.The experimental results showed that our proposed method can achieve better tracking performance than a low-speed target recognition setting.
In future work, we will aim to embark on improving the performance of the proposed vision-based state estimator, including the accuracy and robustness to the environment.Then we will focus on developing a pure vision-based flight controller, which means that the visual information will be used for controlling the UAV directly.Furthermore, we will perform real-time feature extraction and feature tracking of any actual targets.By extracting features in the surrounding environment directly, we aim to realize high-speed manipulation of UAVs by direct visual servoing, thereby attaining various tasks, like high-speed dynamic avoidance and high-speed target catching in any environment.

Figure 3 .
Configuration of target object.(a) Target object.(b) Robot arm for moving target.

Figure 4 .
Figure 4. Image processing for selecting four inliers for pose estimation.

4. 3 .
Relative Pose Estimation-Perspective n Points (PnP) Once we have tracked features p 1 , p 2 ,• • • , p n ∈ R 2 on the image, with a known camera matrix and a set of 3D positions

Figure 7 .
Figure 7.Control loop of relative-position-based high-speed visual servoing.

Figure 8 .
Coordinate systems and relative attitude.(a) Relationship between each coordinate system.(b) Relative yaw angle.

Figure 9 .
Figure 9. Design of direct vision-based position controller.

Figure 10 .
Figure 10.Flowchart of direct vision-based position controller.

Figure 14 .Figure 15 .
Figure 14.Relationship between object coordinate system and world coordinate system.

Table 1 .
Specifications of our platform.

Table 2 .
Tracking performance comparison with different target recognition rates.

Target Recognition Rates z Mean Error z Mean Error std. Yaw Mean Error Yaw Mean Error std.
can be expanded as