Real-Time Visual Tracking of Moving Targets Using a Low-Cost Unmanned Aerial Vehicle with a 3-Axis Stabilized Gimbal System

Featured Application: A complete solution for visual detection and autonomous tracking of a moving target is presented, which is applied to low-cost aerial vehicles in reconnaissance, surveillance, and target acquisition (RSTA) tasks. Abstract: Unmanned Aerial Vehicles (UAVs) have recently shown great performance collecting visual data through autonomous exploration and mapping, which are widely used in reconnaissance, surveillance, and target acquisition (RSTA) applications. In this paper, we present an onboard vision-based system for low-cost UAVs to autonomously track a moving target. Real-time visual tracking is achieved by using an object detection algorithm based on the Kernelized Correlation Filter (KCF) tracker. A 3-axis gimbaled camera with separate Inertial Measurement Unit (IMU) is used to aim at the selected target during ﬂights. The ﬂight control algorithm for tracking tasks is implemented on a customized quadrotor equipped with an onboard computer and a microcontroller. The proposed system is experimentally validated by successfully chasing a ground and aerial target in an outdoor environment, which has proven its reliability and e ﬃ ciency.


Introduction
The past decade has witnessed an explosive growth in the utilization of unmanned aerial vehicles (UAVs), attracting more and more attention from research institutions around the world [1,2]. With a series of significant advances in technology domains like micro-electro-mechanical system (MEMS), many UAV platforms and mission-oriented sensors limited to military affairs in the past are now widely applied to industrial and commercial sectors [3][4][5][6][7]. UAV-based target tracking is one of the most challenging tasks, which is closely related to applications such as traffic monitoring, reconnaissance, surveillance, and target acquisition (RSTA), search and rescue (SAR), inspection of power cables, etc. [8][9][10][11][12][13]. The tracking system is an important part of UAVs which detects a target of interest rapidly in a large area, and then performs continuous surveillance of the selected target in the tracking phase [14]. In order to achieve target acquisition and localization, military UAVs are usually equipped with airborne radars [15] or guided seekers [16]; however, they are too heavy and unaffordable for

Problem Formulation and System Architecture
Although there are many algorithms capable of tracking targets from the video streams, techniques reported in the field of computer vision cannot be easily extended to airborne applications because of high dynamic UAV-target relative motion. In this section, a small commercial drone (41 cm in diameter) is considered as the target with unknown speed. If this unwanted aerial visitor flies into places such as airports, prisons, and military bases where consumer drones are not allowed, it can cause a big problem. Therefore, a lot of anti-UAV defense systems are being introduced to combat the growing threats of malicious UAVs. One way is to use a rifle-like device which sends a high-power electromagnetic wave to jam the UAV control systems and force them to land immediately. Another option is to capture the target in mid-air, using a UAV platform that carries a net gun.
Consider Figure 1, which depicts the UAV-Target relative kinematics and defines the coordinate systems. Let B denote the body frame that moves with the UAV, C the camera frame that is attached to the UAV but rotates with respect to frame B, N a North-East-Down (NED) coordinate system taken as an inertial reference frame, I, the image frame. The origin of the camera frame C is the optical center and Z c coincides with the optical axis of the camera. Following the notation introduced in [52], let p C = [x c , y c , z c ] T denote the position of the target in frame C. The rotation transformation n c R from frame C to frame N is: where the transformation n b R is calculated by the roll, pitch, and yaw angles of the UAV given by the flight controller. b c R can be computed by the gimbal system using the relative angular position measured by the encoders. Let p be the position of the target with respect to the optical center of camera resolved in N, which is given as:

Problem Formulation and System Architecture
Although there are many algorithms capable of tracking targets from the video streams, techniques reported in the field of computer vision cannot be easily extended to airborne applications because of high dynamic UAV-target relative motion. In this section, a small commercial drone (41 cm in diameter) is considered as the target with unknown speed. If this unwanted aerial visitor flies into places such as airports, prisons, and military bases where consumer drones are not allowed, it can cause a big problem. Therefore, a lot of anti-UAV defense systems are being introduced to combat the growing threats of malicious UAVs. One way is to use a rifle-like device which sends a highpower electromagnetic wave to jam the UAV control systems and force them to land immediately. Another option is to capture the target in mid-air, using a UAV platform that carries a net gun.
Consider Figure 1, which depicts the UAV-Target relative kinematics and defines the coordinate systems. Let B denote the body frame that moves with the UAV, C the camera frame that is attached to the UAV but rotates with respect to frame B , N a North-East-Down (NED) coordinate system taken as an inertial reference frame, I , the image frame. The origin of the camera frame C is the optical center and c Z coincides with the optical axis of the camera. Following the notation introduced in [52], let      , , T C c c c p x y z denote the position of the target in frame C . The rotation transformation n c R from frame C to frame N is: where the transformation n b R is calculated by the roll, pitch, and yaw angles of the UAV given by the flight controller. b c R can be computed by the gimbal system using the relative angular position measured by the encoders. Let p be the position of the target with respect to the optical center of camera resolved in N , which is given as: Thus, the target position T p in NED coordinate system can be estimated, which is given as: where BC p is the position of gimbaled camera relative to the UAV body in frame B , and B p is the UAV position in N . Thus, the target position p T in NED coordinate system can be estimated, which is given as:

of 27
where p BC is the position of gimbaled camera relative to the UAV body in frame B, and p B is the UAV position in N. To deal with the task of moving target tracking, an autonomous quadrotor UAV system equipped with a 3-axis gimbaled camera is constructed to detect and follow the flying object in the pursuit-evasion scenario. The visual tracking system processes the images and drives the gimbal to search the target areas. Once an intruding drone is detected, the location of the target in each image frame is acquired, and this is utilized by the automated targeting module for aiming control of the gimbal. While the target is locked down, the camera pose can be used as input information to control the UAV flight. Figure 2 presents the proposed vision-based system.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 28 To deal with the task of moving target tracking, an autonomous quadrotor UAV system equipped with a 3-axis gimbaled camera is constructed to detect and follow the flying object in the pursuit-evasion scenario. The visual tracking system processes the images and drives the gimbal to search the target areas. Once an intruding drone is detected, the location of the target in each image frame is acquired, and this is utilized by the automated targeting module for aiming control of the gimbal. While the target is locked down, the camera pose can be used as input information to control the UAV flight. Figure 2 presents the proposed vision-based system. For most civil UAV applications, the gimbal system and the UAV flight control system are independent of each other. However, to achieve autonomous tracking of a moving target, the two systems are coordinated by the proposed vision algorithm which is implemented in the onboard computer running a Linux based system.

Kinematics of the Gimbal System
While the FOV of a single camera is limited, the gimbal systems are able to rotate the camera to a desired direction, which are widely applied to many fields such as filming and monitoring. When these systems are mounted onboard an UAV, the torque motors are activated by the IMUs and other angular sensors to compensate for all the rotations resulting from the UAV flight, which returns the stable member to its original attitude. As shown in Figures 3 and 4, the gimbal system which is used in this paper consists of direct current (DC) motors that balance the platform, magnetic rotary encoders that sense the relative rotation, embedded stabilization controller that process all the sensors information and output the control signals, the vibration damper that connects the outer gimbal to the UAV body, and the camera that captures the images. For most civil UAV applications, the gimbal system and the UAV flight control system are independent of each other. However, to achieve autonomous tracking of a moving target, the two systems are coordinated by the proposed vision algorithm which is implemented in the onboard computer running a Linux based system.

Kinematics of the Gimbal System
While the FOV of a single camera is limited, the gimbal systems are able to rotate the camera to a desired direction, which are widely applied to many fields such as filming and monitoring. When these systems are mounted onboard an UAV, the torque motors are activated by the IMUs and other angular sensors to compensate for all the rotations resulting from the UAV flight, which returns the stable member to its original attitude. As shown in Figures 3 and 4, the gimbal system which is used in this paper consists of direct current (DC) motors that balance the platform, magnetic rotary encoders that sense the relative rotation, embedded stabilization controller that process all the sensors information and output the control signals, the vibration damper that connects the outer gimbal to the UAV body, and the camera that captures the images. angular sensors to compensate for all the rotations resulting from the UAV flight, which returns the stable member to its original attitude. As shown in Figures 3 and 4, the gimbal system which is used in this paper consists of direct current (DC) motors that balance the platform, magnetic rotary encoders that sense the relative rotation, embedded stabilization controller that process all the sensors information and output the control signals, the vibration damper that connects the outer gimbal to the UAV body, and the camera that captures the images.  The 3-axis gimbaled camera supporting structure consists of the case, outer frame, middle frame and inner frame as depicted in Figure 5. The kinematic relations are set as a yaw, roll, pitch sequence and four reference frames are introduced: the body-fixed frame F , the outer frame O , the middle frame M , the inner frame G connected by three revolute joints. Considering the common structure of gimbals in [53][54][55], relative angles are defined as yaw ( Y ), roll ( R ) and pitch ( P ). The frame F is carried into frame O by rotation  Y around the axis  The 3-axis gimbaled camera supporting structure consists of the case, outer frame, middle frame and inner frame as depicted in Figure 5. The kinematic relations are set as a yaw, roll, pitch sequence and four reference frames are introduced: the body-fixed frame F, the outer frame O, the middle frame M, the inner frame G connected by three revolute joints. Considering the common structure of gimbals in [53][54][55], relative angles are defined as yaw (θ Y ), roll (θ R ) and pitch (θ P ). The frame F is carried into frame O by rotation θ Y around the axis z F . Frame O is carried into frame M by rotation θ R around the axis x O . Finally, frame M is carried into frame G by rotation θ P around the axis y M . The coordinate systems of the gimbal ( Figure 5) are placed parallel to each other as the initial state in the configuration (θ Y ,θ R ,θ P ) = (0, 0, 0).  Figure 5) are placed parallel to each other as the initial state in the configuration ( Y , R , P ) = (0, 0, 0).  The coordinate of an arbitrary point P in frame G denoted as the vector G r P can be described in a different coordinate frame F using the rotation matrix F R G and the translation vector F d G between the frames according to the above relationship: where, in a 3D environment, A more convenient way to describe such transformation is to use homogeneous transformation matrices F T G given as: Several intermediate transformations are required to get the final transformation in Equation (4) given as where With the parameters l 1 , l 2 , h 1 , h 3 and b 2 in Figure 5, the transformations between the frames are as follows.
The transformation between the frame F and frame O: Appl. Sci. 2020, 10, 5064 8 of 27 The transformation between the frame O and frame M: The transformation between the frame M and frame G: Thus, the total rotation matrix F R G and translation vector F d G between frame F and frame G is: Appl. Sci. 2020, 10, 5064 9 of 27 The F R G is called a pitch-roll-yaw rotation matrix according to the order in which the rotation matrices are successively multiplied. In a similar way, the rotation between frame and inertial reference frame can be described as: where α Y , α R , and α P are derived using the information from the gyros and accelerometers on the IMU attached to the camera. The N R G can also be derived using the rotation matrix F R G and N R F , as below: where F ω NF , and the angular velocity of frame F respect to frame N introduced in frame F is measured and available given as: Angular velocities of frame O, M, and G respect to frame N introduced in its own frames are as follows: The inertia matrices of the outer gimbal, middle gimbal, and inner gimbal are: where J kx , J ky , J kz (k = O, M, G) refer to the diagonal elements. For simplicity, it is assumed that the off-diagonal elements of inertia matrices can be neglected and only the moments of inertia are considered. The angular momentum of the pitch gimbal is: the roll gimbal is: the yaw gimbal is: and each member of the gimbal system is treated as a rigid body and the moment equation can be written as: where external torques τ O , τ M and τ G , about z F , x O and y M , respectively, are applied to gimbals from motor and other external disturbance torques.

Stabilization and Aiming
The camera fitted on the innermost frame is inertially stabilized and controlled by the gimbal system. Furthermore, the system is required to align its optical axis in elevation and azimuth with a LOS joining the camera and target. Figure 6 describes the angular geometry of how the gimbaled camera aims at the target, where θ is the pitch angle of the UAV body, δ is the boresight angle, λ is the LOS angle, θ P is the pitch angle of the gimbal frame, and ε is the boresight error angle.

Stabilization and Aiming
The camera fitted on the innermost frame is inertially stabilized and controlled by the gimbal system. Furthermore, the system is required to align its optical axis in elevation and azimuth with a LOS joining the camera and target. Figure 6 describes the angular geometry of how the gimbaled camera aims at the target, where  is the pitch angle of the UAV body,  is the boresight angle,  is the LOS angle,  P is the pitch angle of the gimbal frame, and  is the boresight error angle. There are four operation modes, namely, the preset angle mode, search mode, stabilize mode, and tracking mode as shown in Figure 7. When the gimbal system is powered on and initialized, its direction is kept at   0 in inertial space. In preset angle mode, the optical axis of the camera will be set to a given angle and the control system will maintain the desired direction despite disturbances. Then, the system may switch to search mode, in which the gimbal will rotate circularly between its minimum and maximum angle to search a larger range of area. There are four operation modes, namely, the preset angle mode, search mode, stabilize mode, and tracking mode as shown in Figure 7. When the gimbal system is powered on and initialized, its direction is kept at δ = 0 in inertial space. In preset angle mode, the optical axis of the camera will be set to a given angle and the control system will maintain the desired direction despite disturbances. Then, the system may switch to search mode, in which the gimbal will rotate circularly between its minimum and maximum angle to search a larger range of area.
When a target is confirmed, the control system will switch to the tracking mode and keep the target in the center of the camera view. In addition, if the target is lost and cannot be recaptured in a few seconds, the system will return to the search mode and try to find it again. Figure 8 shows how the control system works which contains two loops: tracking loop and stabilizing loop. When a target is confirmed, the control system will switch to the tracking mode and keep the target in the center of the camera view. In addition, if the target is lost and cannot be recaptured in a few seconds, the system will return to the search mode and try to find it again. Figure 8 shows how the control system works which contains two loops: tracking loop and stabilizing loop. Based on the image information and measurement data received from angular sensors, the tracking loop generates a rate command to direct the boresight towards the target LOS so that the pointing error can be kept near zero. On the other hand, the stabilizing loop isolates the camera from UAV motion and external disturbances, which would perturb the aim-point. The control loops in roll, elevation, and azimuth channels are related by the cross coupling unit based on the gimbal system dynamics, which may be defined as the impact on one axis with the rotation of another [56].

KCF Tracker
The KCF tracker [42,57] that is used in this paper considers the process of sample training as a Ridge Regression problem, which is also a regular minimization problem with a closed solution. When a target is confirmed, the control system will switch to the tracking mode and keep the target in the center of the camera view. In addition, if the target is lost and cannot be recaptured in a few seconds, the system will return to the search mode and try to find it again. Figure 8 shows how the control system works which contains two loops: tracking loop and stabilizing loop. Based on the image information and measurement data received from angular sensors, the tracking loop generates a rate command to direct the boresight towards the target LOS so that the pointing error can be kept near zero. On the other hand, the stabilizing loop isolates the camera from UAV motion and external disturbances, which would perturb the aim-point. The control loops in roll, elevation, and azimuth channels are related by the cross coupling unit based on the gimbal system dynamics, which may be defined as the impact on one axis with the rotation of another [56].

KCF Tracker
The KCF tracker [42,57] that is used in this paper considers the process of sample training as a Ridge Regression problem, which is also a regular minimization problem with a closed solution. Based on the image information and measurement data received from angular sensors, the tracking loop generates a rate command to direct the boresight towards the target LOS so that the pointing error can be kept near zero. On the other hand, the stabilizing loop isolates the camera from UAV motion and external disturbances, which would perturb the aim-point. The control loops in roll, elevation, and azimuth channels are related by the cross coupling unit based on the gimbal system dynamics, which may be defined as the impact on one axis with the rotation of another [56].

KCF Tracker
The KCF tracker [42,57] that is used in this paper considers the process of sample training as a Ridge Regression problem, which is also a regular minimization problem with a closed solution.
Consider a n × 1 vector x = x 1 x 2 · · · x n T as the base sample, which represents a patch with the target of interest. A small translation of this vector is given as: where P is the permutation matrix: and u shifts can be made to achieve a larger translation by using the matrix power P u x. By cyclic shifting operations, we can use these vectors to constitute a circulant matrix as: and it is useful that all circulant matrices are diagonalized by the Discrete Fourier Transform (DFT), regardless of the base sample x, which can be expressed as: wherex is the DFT of base sample,x = F(x), and F is a constant matrix known as the DFT matrix that does not depend on x.
Based on Ridge Regression, the goal of training is to find a function f (z) that minimizes the squared error over samples x i and their regression targets y i , as shown below: where the regularization parameter λ is used to control over fitting, as in the Support Vector Machines (SVM) [57] and w represents the filter coefficients. Consider a linear regression function f (z) = w T z, the minimizer has a closed-form solution given as: where X is the circulant matrix with one sample per row x i , each element of y is a regression target y i , and I is an identity matrix. By utilizing the diagonalization of the circulant matrices, Equation (38) can be expressed in Fourier domain as:ŵ whereŵ,x,ŷ are the DFT of w, x, y, respectively, andx * is the complex-conjugate. In addition, w can be easily recovered in the spatial domain with Inverse Discrete Fourier Transform (IDFT).
When regression function f (z) is nonlinear, the kernel trick is used to map the inputs of a linear problem to a nonlinear and high-dimensional feature space ϕ(x): Then, the variables under optimization are α instead of w. The kernel function k is used to compute the algorithm in terms of dot-products, as shown below: where all the dot-products between samples are stored in a n × n kernel matrix K, with elements: and the regression function f (z) can be expressed as: The solution of this regression function can be given as: where K is the kernel matrix and α is the vector of coefficients α i , which express the solution in the dual space. By making K circulant, Equation (44) can be diagonalized as in the linear case, obtaining: wherek xx is the correlation kernel of x with itself in Fourier domain andα,ŷ are the DFT of vector α, y. For the kernel matrix K z between all training samples (cyclic shifts of x) and candidate image patches (cyclic shifts of base patch z), each element of K z is given by k(P i−1 z, P j−1 x). From Equation (43), the regression function can be computed for all candidate patches with: where f (z) is a vector, containing the output of all cyclic shift of z, which is the full detection response. The position where the output response takes the maximum value is the position of the target in a new frame. To compute Equation (46) efficiently, it is diagonalized as shown below: where a hat ∧ denotes the DFT of a vector. In this paper, given the nonlinear Gaussian kernel k(x, x ) = exp(− 1 σ 2 x − x , 2 ), we get: where the kernel correlation can be computed by using a few DFT/IDFT and element-wise operations in O(n log n) time. Henriques et al. [42] proved that the conversion from inverse operation of matrix in spatial domain to matrix multiplication in Fourier domain would greatly reduce the computational complexity and shorten the computation time.
In the tracking process, considering the target variations, such as illumination, scale, occlusion, and deformation factors, the target apparent model and coefficient vector are updated after each frame [58], as shown below: where x t−1 , x t are the target model updated after the t − 1 and the t frame. α t−1 , α t are the coefficient vector updated after the t − 1 and the t frame. η t is the learning rate.

Target Localization
Consider a monocular camera model as shown in Figure 9.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 28 where 1 t x , t x are the target model updated after the  1 t and the t frame.  1 t ,  t are the coefficient vector updated after the  1 t and the t frame.  t is the learning rate.

Target Localization
Consider a monocular camera model as shown in Figure 9.  (52) where ( , , )  (51) and (52) can be integrated as:  When an arbitrary point p(x w , y w , z w ) in world frame W is detected by the camera, its 2D position p i (x i , y i ) on the image plane can be given as: where (x c , y c , z c ) is the position of p in the camera frame, f is the focal length, (u, v) is the target position in pixel values, (u 0 , v 0 ) is the intersection of the optical axis, and the image plane, dx and dy are the physical length per pixel in the x i and y i axis directions. Equations (51) and (52) can be integrated as: where [x 1c , y 1c , 1] T is the point on the normalized plane and M in is the intrinsic parameter matrix of the camera, which is given as: With a calibrated camera, the pointing error angle ε in elevation and azimuth can be computed by using the above relations, as shown below: where ε χ and ε γ are the pointing error or boresight error in azimuth and elevation, respectively, which can be put into the target aiming system as input information to control the gimbal. A scalable KCF tracker is used in this paper, with the scale changes of the target taken into consideration. The tracker not only updates the centroid position of the target in the image frame, but also outputs the target size in pixel values. This can be used to control the distance between the interceptor and the target while tracking, though the physical size of the target is unknown. The running results of tracking a pedestrian with the proposed tracker are shown in Figure 10. The bounding box changes with adaption to the target variations in the video streams. With a calibrated camera, the pointing error angle  in elevation and azimuth can be computed by using the above relations, as shown below: where   and   are the pointing error or boresight error in azimuth and elevation, respectively, which can be put into the target aiming system as input information to control the gimbal. A scalable KCF tracker is used in this paper, with the scale changes of the target taken into consideration. The tracker not only updates the centroid position of the target in the image frame, but also outputs the target size in pixel values. This can be used to control the distance between the interceptor and the target while tracking, though the physical size of the target is unknown. The running results of tracking a pedestrian with the proposed tracker are shown in Figure 10. The bounding box changes with adaption to the target variations in the video streams.

UAV Dynamic Model
The 6 DOF motion of a rigid quadrotor is described in Figure 11.

UAV Dynamic Model
The 6 DOF motion of a rigid quadrotor is described in Figure 11.
The equations of motion can be described as: where b f is the force applied to the quadrotor given in B ,  is the torque, g is the gravitational acceleration and  b is the angular velocity of the quadrotor in B . The gyroscopic moment a G is mainly produced by propellers, which can be neglected. In addition, the translational dynamics shown in Equations (58) and (59)   Let m denote the mass of the quadrotor and J the moment of inertia. The external forces and torques which act on the quadrotor platform are primarily caused by propellers and gravity. A local NED frame N and body-fixed frame B are introduced to describe the motion of the quadrotor. n p = n p x n p y n p z T and n v = n v x n v y n v z T are the position and linear velocity of the quadrotor's mass center relative to N. Θ = [φ θ Ψ] T is the roll/pitch/yaw angles, which represents the orientation of the quadrotor in N. The rotation matrix R n b from B to N is expressed as: cos θ cos ψ sin φ sin θ cos ψ − cos φ sin ψ cos φ sin θ cos ψ + sin φ sin ψ cos θ sin ψ sin φ sin θ sin ψ + cos φ cos ψ cos φ sin θ sin ψ − sin φ cos ψ − sin θ sin φ cos θ cos φ cos θ The equations of motion can be described as: n .
where f b is the force applied to the quadrotor given in B, τ is the torque, g is the gravitational acceleration and b ω is the angular velocity of the quadrotor in B. The gyroscopic moment G a is mainly produced by propellers, which can be neglected. In addition, the translational dynamics shown in Equations (58) and (59) can be simplified as: n ..
Furthermore, it can be assumed that sin φ ≈ φ, cos φ ≈ 1, sin θ ≈ θ, and cos θ ≈ 1 for small angle approximation, which leads to a simplified dynamic model as described in [59].

Tracking Strategy
The tracking strategy used in this paper is based on proportional navigation (PN), which is a well-known guidance law and has been widely used to enable a missile to catch its target in optimal time. The constant bearing approach considers that the missile will finally collide with the detected target if the LOS angle is kept constant. The PN method improves the constant bearing approach to accommodate for target maneuver by accelerating the missile in a direction lateral to the LOS with magnitude proportional to the rate change of the LOS angle. There are different types of PN methods according to their different mathematical formulations and their performances have been analyzed in [60], when applied to guidance of a quadrotor.
The desired acceleration .. u des obtained by the PN method can be expressed as: ..
λ is rate change of the LOS angle, N is the navigation gain, and L is the normal direction of the acceleration command that is calculated for different ways as follows: where L RTPN , L IPN , L PPN , and L NGL represent for Realistic True Proportional Navigation (RTPN) [61], Ideal Proportional Navigation (IPN) [62], Pure Proportional Navigation (PPN) [63], and Nonlinear Guidance Law (NGL) [64], respectively. u ∈ R 2 are the position, velocity, and acceleration of the interceptor, respectively. |·| represents the magnitude of the vector and β is the angle between interception velocity and LOS of the target.
In addition, . λ can be described as: The aim of the research in this paper is to track a moving target with a quadrotor platform, and coordinated control of the UAV and gimbaled camera is considered. As shown in Figure 12a,b, it can be done in two directions: the longitudinal direction and the lateral direction [65].
The λ χ , λ γ are the lateral and longitudinal LOS angle of the target, respectively. The ε χ , ε γ are the boresight error angle in lateral and longitudinal direction, respectively, which are controlled to be zero ( ε χ , ε γ → 0 ). The  When the tracking is initiated, the UAV follows the target by tracking its lateral LOS angle with a forward speed   u , which is the horizontal component of the approaching velocity App u  . To keep following the target moving at unknown speed, the approaching velocity App u  is decided by the scale changes of the target in the image frame. Then, a PPN guidance law will activate and the acceleration command  des PPN u will be applied to the UAV according to the rate changes of   , as shown in Figure   12b.

Flight Control System
The quadrotor is a typical underactuated system with only four independent inputs less than the degrees of freedom, so only the desired position and desired yaw angle can be directly tracked. The desired roll angle and the desired pitch angle are determined by the known ones. The flight control system of a quadrotor is described in Figure 13.

Flight Control System
The quadrotor is a typical underactuated system with only four independent inputs less than the degrees of freedom, so only the desired position and desired yaw angle can be directly tracked. The desired roll angle and the desired pitch angle are determined by the known ones. The flight control system of a quadrotor is described in Figure 13. When the tracking is initiated, the UAV follows the target by tracking its lateral LOS angle with a forward speed   u , which is the horizontal component of the approaching velocity App u  . To keep following the target moving at unknown speed, the approaching velocity App u  is decided by the scale changes of the target in the image frame. Then, a PPN guidance law will activate and the acceleration command  des PPN u will be applied to the UAV according to the rate changes of   , as shown in Figure   12b.

Flight Control System
The quadrotor is a typical underactuated system with only four independent inputs less than the degrees of freedom, so only the desired position and desired yaw angle can be directly tracked. The desired roll angle and the desired pitch angle are determined by the known ones. The flight control system of a quadrotor is described in Figure 13.  The pixel values u T , v T are the centroid of the target position in each image frame, and S T is the scale change of the target, which is given by the proposed KCF tracker. The state of the quadrotor (Θ, ω) expressed in global NED frame is given by the autopilot using an Extended Kalman Filter (EKF).
The desired position x d y d z d T and the desired yaw angle ψ d are given by the gimbal system in the global NED using the above-mentioned tracking strategy. A cascade proportional-integral-derivative (PID) controller is designed to individually control the 6 DOF motion of the quadrotor. The attitude control loop is implemented on the microcontroller, while the outer loop for position control is implemented on the onboard computer. All PID gains have been preliminarily tuned in hovering flight tests. The outputs of the cascade PID controller are the desired force f d and the desired torque τ d , which are applied to the UAV body. The mixer gives the desired angular velocity of each motor to the electronic speed controller (ESC), which is expressed as a Pulse Width Modulation (PWM) signal.
It is worth mentioning that the thrust value is not only determined by the desired position, but also by the takeoff weight of the quadrotor platform. Thus, the height control can be considered as two parts: a slightly changed base value for hovering control and a fast controller for position control.
When the target is selected, the quadrotor will keep the distance relative to the target based on the estimation of its scale changes in the image frame, which is shown in Supplementary Materials.

Experimental Setup
Most experimental UAVs are equipped with expensive sensors and devices, such as high-precision IMUs, 3D light detection, and ranging (Lidar) sensors and differential global positioning system (DGPS), which will definitely improve the control accuracy but are unaffordable in many practical applications. To test the proposed tracking system in this paper, a customized quadrotor platform (65 cm in diameter) is used to perform all the flight experiments, which weighs 4.2 kg including all the payloads, as shown in Figure 14. The cost is much lower than the other platforms (e.g., DJI Matrice 200). The 3-axis gimbal at a dimension of 108 × 86.2 × 137.3 mm 3 , weighs only 409 g, as shown in Figure 15a. The IMUs and encoders are consumer sensors at very low prices. The camera with focal lengths ranging from 4.9-49 mm has a maximum resolution of 1920 × 1080 at 60 frames per second (fps). The cost of the gimbaled camera is less than $400, which makes it very attractive considering its great performance. The AS5048A magnetic rotary encoder used in the gimbal system measures the angular position of each axis, which has a 14-bit high resolution output (0.0219 deg/LSB).
the desired torque d  , which are applied to the UAV body. The mixer gives the desired angular velocity of each motor to the electronic speed controller (ESC), which is expressed as a Pulse Width Modulation (PWM) signal.
It is worth mentioning that the thrust value is not only determined by the desired position, but also by the takeoff weight of the quadrotor platform. Thus, the height control can be considered as two parts: a slightly changed base value for hovering control and a fast controller for position control. When the target is selected, the quadrotor will keep the distance relative to the target based on the estimation of its scale changes in the image frame, which is shown in Supplementary Materials.

Experimental Setup
Most experimental UAVs are equipped with expensive sensors and devices, such as highprecision IMUs, 3D light detection, and ranging (Lidar) sensors and differential global positioning system (DGPS), which will definitely improve the control accuracy but are unaffordable in many practical applications. To test the proposed tracking system in this paper, a customized quadrotor platform (65 cm in diameter) is used to perform all the flight experiments, which weighs 4.2 kg including all the payloads, as shown in Figure 14. The cost is much lower than the other platforms (e.g., DJI Matrice 200). The 3-axis gimbal at a dimension of 108 × 86.2 × 137.3 mm 3 , weighs only 409 g, as shown in Figure 15a. The IMUs and encoders are consumer sensors at very low prices. The camera with focal lengths ranging from 4.9-49 mm has a maximum resolution of 1920 × 1080 at 60 frames per second (fps). The cost of the gimbaled camera is less than $400, which makes it very attractive considering its great performance. The AS5048A magnetic rotary encoder used in the gimbal system measures the angular position of each axis, which has a 14-bit high resolution output (0.0219 deg/LSB).  As shown in Figure 15b, the quadrotor is equipped with an embedded microcontroller developed by the Pixhawk team at ETU Zürich [66]. The selected firmware version is 1.9.2.
To achieve real-time image processing, a NVIDIA Jetson TX2 module is used as an onboard computer to implement the tracking algorithm, which is almost the fastest and most power-efficient embedded AI computing device. The output data of the vision system can be transferred from the onboard computer to the Pixhawk flight controller using serial communication, which is based on a As shown in Figure 15b, the quadrotor is equipped with an embedded microcontroller developed by the Pixhawk team at ETU Zürich [66]. The selected firmware version is 1.9.2.
To achieve real-time image processing, a NVIDIA Jetson TX2 module is used as an onboard computer to implement the tracking algorithm, which is almost the fastest and most power-efficient embedded AI computing device. The output data of the vision system can be transferred from the onboard computer to the Pixhawk flight controller using serial communication, which is based on a MAVLINK [67] extendable communication node for the Robot Operation System (ROS) [68]. The control rate is at 30 Hz limited by the onboard processing speed. By using a 2.4 GHz remote controller (RC), the tracking process can be initiated by switching to the offboard mode when a target is selected.

Experimental Results and Analysis
To evaluate the performance of the proposed tracking and targeting system, we test it in different situations. After selecting a target in the video streams, the gimbal system is activated and rotates the camera to point at the selected target, which can be regarded as a step response. The boresight error pixels are plotted in azimuth and elevation, respectively, which are also printed on the top left of the screen.
As shown in Figure 16, the system responds rapidly and the steady-state error is about ±3 pixels, while the initial errors are hundreds of pixels. This test is usually used to tune all the control parameters of the gimbal system, which can be completed on the ground.  Then, the gimbal system and the onboard computer are fixed on the experimental quadrotor platform for further tests. To aim at a target from the UAV, the motion and vibration of the platform should be isolated. Once the target is locked, the UAV is in a fully autonomous mode controlled by the integrated system. Figure 17 shows the results of the boresight error while the UAV is following a pedestrian moving at 0.9-2 m/s. In about 250 seconds' flight, an accuracy of ±9.34 pixels in the azimuth and ±5.07 pixels in the elevation was achieved. While tracking a ground moving pedestrian, the altitude change of the target can be ignored in most situations, which makes the task less difficult. However, tracking a flying drone is much more complicated. The drones are able to change its position and velocity in a very short time, which may cause tracking errors or failures. During the flight experiment, autonomous tracking of an intruded drone has been achieved. Figure 18 shows the results of the boresight error while the UAV is tracking a flying drone. While tracking a ground moving pedestrian, the altitude change of the target can be ignored in most situations, which makes the task less difficult. However, tracking a flying drone is much more complicated. The drones are able to change its position and velocity in a very short time, which may cause tracking errors or failures. During the flight experiment, autonomous tracking of an intruded drone has been achieved. Figure 18 shows the results of the boresight error while the UAV is tracking a flying drone. Some oscillations still remain in the current configuration, which occurred when the flight path of the target suddenly changed. It is a great challenge for the system to catch up with the target in such a short time. The deviations caused by the image transmission delay cannot be ignored, which is about 220 milliseconds for the current system. Other factors such as image noises and illumination changes may also have an impact on the tracking accuracy to some extent. The root mean square errors (RMSEs) of boresight errors in drone tracking experiment are listed in Table 1, compared with the other two tests.  Some oscillations still remain in the current configuration, which occurred when the flight path of the target suddenly changed. It is a great challenge for the system to catch up with the target in such a short time. The deviations caused by the image transmission delay cannot be ignored, which is about 220 milliseconds for the current system. Other factors such as image noises and illumination changes may also have an impact on the tracking accuracy to some extent. The root mean square errors (RMSEs) of boresight errors in drone tracking experiment are listed in Table 1, compared with the other two tests.

RMSE (Pixel) Azimuth Elevation 2D
Step  Figure 19 shows the process of a successful drone tracking experiment. During the flight, the roll/pitch/yaw angle of the UAV, camera, and gimbal frames are plotted, respectively, in Figure 20a-c. The actual approaching speed of the UAV has tracked the setpoint changes accurately as shown in Figure 20d. The trajectories of the intruded drone and the interceptor are plotted in a local NED frame as shown in Figure 21.
Higher control rate and less image transmission delay would significantly improve the response speed and the accuracy, if better hardware configuration were used. However, considering that all the sensors and onboard devices are low-cost, the performance of the proposed tracking system is very attractive in practical applications. A video of the experiments is available in the Supplementary Materials section (Video S1). Some oscillations still remain in the current configuration, which occurred when the flight path of the target suddenly changed. It is a great challenge for the system to catch up with the target in such a short time. The deviations caused by the image transmission delay cannot be ignored, which is about 220 milliseconds for the current system. Other factors such as image noises and illumination changes may also have an impact on the tracking accuracy to some extent. The root mean square errors (RMSEs) of boresight errors in drone tracking experiment are listed in Table 1, compared with the other two tests.  Figure 19 shows the process of a successful drone tracking experiment. During the flight, the roll/pitch/yaw angle of the UAV, camera, and gimbal frames are plotted, respectively, in Figure 20ac. The actual approaching speed of the UAV has tracked the setpoint changes accurately as shown in Figure 20d. The trajectories of the intruded drone and the interceptor are plotted in a local NED frame as shown in Figure 21.    Higher control rate and less image transmission delay would significantly improve the response speed and the accuracy, if better hardware configuration were used. However, considering that all the sensors and onboard devices are low-cost, the performance of the proposed tracking system is very attractive in practical applications. A video of the experiments is available in the Supplementary Materials section (Video S1).

Conclusions
In the presented work, we proposed an onboard visual tracking system which consists of a gimbaled camera, an onboard computer for image processing and a microcontroller to control the UAV to approach the moving target. Our system used a KCF-based algorithm to detect and track an arbitrary object in real time, which has proved its efficiency and reliability in experiments. With the Figure 21. The actual position (a-c) and 3D trajectory (d) of the UAV and the target drone in the local NED frame.

Conclusions
In the presented work, we proposed an onboard visual tracking system which consists of a gimbaled camera, an onboard computer for image processing and a microcontroller to control the UAV to approach the moving target. Our system used a KCF-based algorithm to detect and track an arbitrary object in real time, which has proved its efficiency and reliability in experiments. With the visual information, the 3-axis gimbal system autonomously aims at the selected target, which has achieved good performance during real flights. The proposed system has been demonstrated through real-time target tracking experiments, which enabled a low-cost quadrotor to chase a flying drone as shown in the video.
Future work may include using a laser ranging module attached to the camera, which is able to provide an accurate distance of the UAV with respect to the target. Even though this will increase the cost of the system, we look forward to its potential applications such as target geo-location and autonomous landing. Performance improvements could also be achieved by using deep learning-based detection algorithms combined with a large number of sample images. The CMOS sensors used in this paper are low-cost, which could lead to the effect of rolling shutter [69]. If the error is dramatic, compensation should be made to handle this issue.