Visual Servoing of a Moving Target by an Unmanned Aerial Vehicle

To track moving targets undergoing unknown translational and rotational motions, a tracking controller is developed for unmanned aerial vehicles (UAVs). The main challenges are to control both the relative position and orientation between the target and the UAV to within desired values, and to guarantee that the generated control input to the UAV is feasible (i.e., below its motion capability). Moreover, the UAV is controlled to ensure that the target always remains within the field of view of the onboard camera. These control objectives were achieved by developing a nonlinear-model predictive controller, in which the future motion of the target is predicted by quadratic programming (QP). Since constraints of the feature vector and the control input are considered when solving the optimal control problem, the control inputs can be bounded and the target can remain inside the image. Three simulations were performed to compare the efficacy and performance of the developed controller with a traditional image-based visual servoing controller.


Introduction
Unmanned aerial vehicles (UAVs) have attracted much attention since the agileness made them capable of adapting to diverse terrains and executing various tasks such as monitoring, rescue, and target tracking [1,2]. Studies of UAVs have focused on localizing a target from sensing data of cameras [3], radars [4], and sensor networks [5][6][7]. Due to the advantageous availability and cost of cameras, many approaches can reconstruct the environment by 2D image features. Algorithms for estimating the position of objects based on image features captured by stationary cameras [8] and moving cameras [9] have been presented. Methods have been developed to reconstruct the 3D model of an object from a perspective camera [10,11]. Other features such as lines [12], cylinders and spheres [13], or planes [14,15] are also applied to a structure-from-motion (SfM) task. However, the aforementioned SfM methods rely on triangulation to recover the depth, and extension of these works to the motion estimation of moving targets is still challenging.
Stationary targets: Image-based visual servoing (IBVS) is a system control method that guarantees a series of visual features of a target will converge to the desired setpoints in the image [16]. However, IBVS approaches have potential problems such as causing larger tracking errors or losing the tracking altogether, especially when the target motion is timevarying or not predicted correctly [17]. Visual predictive control (VPC) [18,19] is a method that combines model predictive control constraints such as in the field of view (FOV), output limitations of the actuator, and workspace. A nonlinear model predictive controller was applied to an underwater vehicle to generate a desired velocity while satisfying the constraints of the visibility [20]. Similarly, MPC has been used to control a mobile robot while tracking a stationary feature point [21], to keep a visual feature of the target in the curtain position of the image [22,23], and to maximize the visibility of a target and minimize the velocity of its image feature when utilizing quadrotors. Artificial patterns 1.
An optimal tracking controller is developed to track a moving target undergoing unknown motion while meeting motion, control, and sensing constraints, where the relative motion between the target and the UAV is estimated. 2.
In contrast to previous approaches [34][35][36][37][38], this work models the dynamics of the relative rotation between the target and the UAV, and the relative angle can be controlled to a predefined desired value, which can be applied in applications such as the automatic searching, detection, and recognition of car license plate.

3.
The controller is designed to ensure that the target remains within the FOV.

4.
Compared to the IBVS controller, the developed controller can ensure smooth control input, less energy, and smaller tracking error. 5.
The developed control architecture can be applied to track other moving targets as long as it can be detected by the YOLO network.
The organization of this work is as follows: Section 2 formulates the interaction between the target and the UAV as well as the control objectives. Section 3 describes how the target velocity is estimated based on the UKF and the bounding box in the image. Section 4 designs a controller to meet the control objectives. Simulations are presented in Section 5 to verify the efficacy of the developed controllers.

Preliminaries
Control Objectives Figure 1 depicts the kinematics of a dynamic target and a camera fixed on a UAV, and the superscript G and the subscript C denote the inertial frame and the camera frame, respectively. The vector between the target and the camera can be expressed as where r q = x q , y q , z q T is defined as the target position, r c = x c y c z c T is the camera position, r q/c = X, Y, Z is defined in the camera frame in order to facilitate the subsequent analysis, and Z defined in r q/c is the depth d ∈ R. Based on (1), the relative velocityṙ q/c can be expressed aṡ where V c v cx v cy v cz T and ω c ω cx ω cy ω cz T are defined as the translational and angular velocities of the camera. In (2), V q = v qx v qy v qz T denoted the translational velocity of the target is unknown and will be estimated, and V c and ω c are the control input to be designed in the subsequent sections. The orientation of the target denoted as ψ ∈ R shown in Figure 2 can be expressed as: where the superscript denotes the value defined in the global frame. Taking the time derivative on both sides of (3) yields the relative angular velocity between the target and the UAV:ψ The control objective is to achieve where implies that the target remains at the center of the image, and d des and ψ des ∈ R are the desired depth and angle, with respect to the target. The limitation of existing results is relaxed by making Assumption 1.

Kinematics Model
To model the tightly coupled motion between the camera and the target, the states of the system are defined as where denotes the state related to the image feature. By taking the time derivative of (6), the dynamics of the visual servoing system can be obtained aṡ where (2) is used and ζ 1 , ζ 2 , η 1 , η 2 , x c/q , y c/q ∈ R are defined as

Measurements: Bounding Box and Orientation
This section defines two measurements performed using a YOLO DNN as depicted in Figure 3. The bounding box and the orientation measurement provide information to correct the states in the UKF, and the detailed description of the DNN can be found in [41]. However, it is hard to defined the accuracy of the estimation since the output of the DNN network is classified into 24 classes. For example, as shown in the video (https://www. youtube.com/watch?v=KMQD7KzsnPE&list=PLrxYXaxBXgRqAUZyX5TsTsPaBtfSYFpHL& index=15) (accessed on 5 June 2021), the upper window demonstrates the estimation of the angle of the target, which is classified into 24 classes from 0 to 360 degree in order to reduce computational efforts. Therefore, when the camera moved from the left to the front, the angle only experienced 6 classes, and the accuracy calculated by comparing the differences between the 6 classes and the continuous variation of angles is not rigorously defined in the existing literature. Nevertheless, the estimation is affordable to be used for sensing the angle of the target in our application.

Bounding Box
Based on the pinhole model, states x 1 and x 2 can be obtained as where f x and f y are the focal lengths of the camera, [c u , c v ] represents the center of the image frame,ū andv represent the center of bounding box that encloses the detected moving target in the image frame as depicted in Figure 1, and d is the depth obtained from the pinhole model where a is the area of the bounding box, and A is the ground truth side area of the target (i.e., typical size for a sedan is 4.6 m × 1.5 m).

Remark 1.
The estimate of distance to the vehicle is obtained by (11) and is not directly generated from the DNN. That is, given the generated bounding box, its area A (i.e., unit in pixel) can be calculated and used to calculate the depth by (11). Therefore, as long as the generated bounding box can accurately enclose the target in the image and and the focal lengths are accurately obtained by calibration, the estimated depth will be accurate.
Based on the camera measurement, pinhole model, orientation of the target, and the location of the UAV, the measurement model can be derived as where (1), (7), and (9) are used, and ψ m is ψ measured as described in Section 3.2.2, and d is the relative distance between the target and the UAV. Measurements d and φ are defined in detail in Section 3.2.

Remark 2.
Changes in the target velocity can result in inaccurate state estimations, and therefore a UKF [42] is applied to address this issue. The model mismatch between the real and the model defined in (8) is considered as process noise, and the states can be estimated and updated to approach the ground truth in the UKF.

Orientation Measurement
As shown in Figure 2, ψ m can be obtained from a YOLO network with the captured image from the onboard camera and is used to measure where the UAV is located observed by the camera, so that the controller developed in Section 4 can ensure the control objective ψ → ψ des defined in (5) can be achieved.

Observability
Since the system dynamics defined in (8) is nonlinear, the approach to evaluate the observability of the system developed in (13) of [43] is applied to establish Theorem 1.

Proof. Given an observability matrix O defined as
where x 1 , . . . , x n are the elements of state x defined in (6), h(x) is the measurement defined in (12), and L i f (h(x)) denoted the ith order Lie derivative of h(x) is defined as where f =ẋ is the vector field, and the submatrix of observability matrix O denoted as and can be rewritten as Based on (14), O s is full-rank since rank(O s ) = 10 based on rows 1~7 and rows 12~14, which implies O is full-rank matrix and the system is observable.

Controller Architecture
Compared to the existing IBVS control methods [16], the controller developed in this work considers the constraints of visual features and control input in order to improve the tracking performance. The control input is calculated by minimizing the cost function designed in this section that considers not only the error of the feature vector but also the velocity of the moving target. Using the controller can lead to less-aggressive flying behavior and also decrease the energy consumption by choosing appropriate gain matrices in the cost function.

Controller Design
The dynamics model defined in (8) is derived for the process step in the UKF so that the target state can be estimated. The dynamics model of s m (t) = [x 1 , x 2 , x 3 , ψ] T ∈ R 4 used for prediction in the controller can be written aṡ where the control input can be reduced to u v qx , v qy , v qz , ω cy T ∈ R 4 . To deal with the modeling errors associated with the use of the dynamics model and the imperfect actuation of the UAV, an error signal k is designed as: where s k is estimated using the UKF and s m,k is predicted using the model (15) at time k.
The desired feature vector utilized in the controller can be defined as where s * x * 1 , x * 2 , x * 3 , ψ * T ∈ R 4 is the reference feature vector prescribed by the user. With (15) and (17), the optimal control problem (OCP) of the controller can be expressed as min s m ,u where u d ∈ R 4 defined as the desired velocity that control input u needs to achieve and can be updated byΓ i over the prediction horizon, whereΓ i (t) ∈ R 3 is an nth-order polynomial obtained by the QP and defined in the interval [t i−1 , t i ) expressed aŝ where c ij ∈ R 3 denotes the jth-order coefficient, S m s m,k/k , s m,k+1/k , . . . , s m,k+N p /k denotes the predicted features at sampling time k based on the control sequence U = u k/k , u k+1/k , . . . , u k+N p −1/k and the estimation s k obtained using the UKF. Q s , R u , and W s defined in (18) are positive weighting matrices, and N p ∈ N is the prediction horizon. Increasing N p results in less-aggressive control inputs but also increases the computational effort. F m (· ) is the model defined in (15). Since only the linear velocity of the target can be estimated (i.e.,Γ i ), control input ω cy only depends on the cost of feature errors s m − s d . S and U are the sets of constraints of the feature vector and the control input given by When solving the OCP at time k, k would be calculated by (16) using the feature vector estimated using the UKF at time k and the feature vector predicted by the model (15) at time k − 1. This error signal is assumed to be constant over the prediction horizon; s d,k+i/k , i = 0, . . . , N p would be constant over the prediction horizon.
The direct multiple shooting method is employed to solve the problem, since "lifting" the OCP to a higher dimension can usually speed up the rate of convergence. The OCP is repeated at every sampling time, with only the first vector u k/k in control set U being adopted in the system. Several software libraries can be used to solve this kind of nonlinear programming problem (NLP), and this study used CasADi for formulating the NLP and Ipopt for solving it. Figure 4 can be applied to general target tracking. That is, the feature vector and orientation generated from YOLO can be used to predict the motion of target v q , which can facilitate tracking performance and, in turn, enhance the detection of bounding boxes.   The control goal in the tracking task is to keep the moving target within the FOV of the camera at the desired depth and angle relative to the target.  (21) in (17) which defines the desired feature vector, as shown in Figure 6. Figure 6. The red dot inside the bounding box represents the reference feature vector as defined in (21) and is used in the simulation as prescribed by the user.

Remark 3. The control architecture depicted in
In order to maintain the visibility and address the limitations of the actuator while tracking the moving target, the constraints of the states and constraints of the control inputs were considered in the numerical simulations, as listed in Tables 1 and 2. The parameters for tuning the sampling time, predicted horizons, and the weighting matrices in the cost function are presented in Table 3.

States
Min. Max.

Simulation Results
To demonstrate the practicability and efficiency of the developed controller, three simulations are described that compared the controller and the traditional IBVS controller. In the first simulation, the UAV tracked a target that was at a relative angle that changed over time in order to show the benefit of including the angle dynamics in the dynamics model. In the second simulation, the UAV tracked a target moving over a z-shaped path in order to compare the tracking performance of different designs of controllers in an aggressive-motion condition. In the third simulation, the IBVS controller and the controller were utilized to demonstrate the tracking efficiency in an aggressive-motion condition.

Simulation 1: Controller with Relative Rotational Dynamics
The scenario of Simulation 1 is depicted in Figure 7. The controller designed in Section 4 was implemented to track a target at a time-varying angle. In Case 1, the rotational dynamics was not considered in the dynamics model, and the relative orientation was measured directly using a YOLO network. In Case 2, the angle dynamics was taken into consideration in the UKF, which made the relative orientation more robust to noisy or intermittent measurements from the YOLO network.   The performances are quantified as the root mean square (RMS) tracking errors defined in Table 4.    Figures 11 and 12 show the velocity control inputs for Case 1 and Case 2 in Simulation 1, respectively, and the upper and lower bounds on the control inputs defined in Table 2 are highlighted.   Figure 13 compares the total control inputs, which shows that they were lower in Case 2 than in Case 1. "The total control inputs are defined as the summation of u k+i/k solved from the cost problem defined in (18) over the simulation time period (i.e., 50 s). The higher the value, the higher the energy consumption since the energy consumption of the UAV is proportional to the cube of its speed u k+i/k defined below (15) based on [44]".

Simulation 2: Controller with Target Motion Pattern
The scenario of Simulation 2 is depicted in Figure 14, where the UAV is controlled to track a target moving over a z-shaped path. The controllers for Cases 1 and 2 are compared. In Case 1, the difference in the contiguous control inputs to the UAV (i.e., differential control cost) is considered in the controller. In Case 2, by leveraging the motion pattern estimated using the UKF, the difference between the velocities of the UAV and the target can be considered in the controller designed in Section 4.   The performances are quantified based on the RMS tracking errors defined in Table 5.  Figure 17 compares the depth and relative angle tracking errors in Case 1 and Case 2, which also shows that Case 2 performed better than Case 1. Thanks to the motion pattern estimated using the UKF, the controller developed in Section 4 performed better than the traditional optimal controller with differential control cost.  Figures 18 and 19 show the velocity control inputs for Cases 1 and 2, respectively, and the upper and lower bounds on the control inputs defined in Table 2 are highlighted. The control inputs were much more stable and smoother for Case 2 than for Case 1.

Simulation 3: IBVS vs. the Developed Controller
The scenario in this simulation is the same as that depicted in Figure 14, but the performances of the IBVS controller and the controller designed in Section 4 are compared to track a target moving over a z-shaped path in Cases 1 and 2. The stability of the IBVS controller and the controller is also evident from Table 6, which lists the RMS tracking errors. Figure 22 compares the state vector tracking errors in Cases 1 and 2. The performances quantified based on the RMS tracking errors are presented in Table 6.   Table 2 are highlighted. Figure 23 shows that the linear and angular velocity control input generated for the IBVS controller keeps fluctuating while tracking. In contrast, Figure 19 shows that the linear and angular velocity control inputs of the controller are stable and smooth.  Figure 24 shows that compared with the IBVS controller, the controller needs much less energy to track the moving target. In other words, the controller is much more efficient than the IBVS controller since it saves energy by not generating unnecessary motion.

IBVS Controller vs. The Developed Controller
The IBVS controller relies on the image-feature error of the target for feedback, which results in the motion of the UAV becoming unstable and oscillating with the inclusion of a large error (i.e., large displacement and rotation) as shown in Figure 24. In contrast, the developed controller is designed to minimize the cost function that considers not only the current feature error but also predicts the future states of the image features in the predicted horizons according to the dynamics model. Additionally, the constraints of the control inputs are taken into consideration in order to avoid excessive and unreasonable motions. Figure 23 shows that the angular velocity exceeds the limitation of the control inputs defined in Table 2; however, the angular velocity generated by the controller in Figure 19 prevents unreasonable motion from occurring.

Conclusions
A tracking controller has been developed to track a moving target, and it relies on bounding box features generated by a YOLO network and a target motion pattern obtained by QP, in which the target states are estimated using a UKF. The main features of the developed controller ensure the following characteristics:

1.
The control effort is less than for traditional controllers, while the tracking error is quantified by the RMS error and is less than that for a traditional controller.

2.
The value of the control input to the controller always remains within the UAV motion capabilities.

3.
The target can be tracked, and its image features always remain within the image, which is not guaranteed for a traditional IBVS controller.
All of the above results were verified in three simulations. The first simulation illustrated that the controller, which considers the relative rotational dynamics, can perform better in tracking a target at a specified angle. The second simulation indicated that using the predicted motion pattern of a target in the controller can improve the tracking performance compared to the traditional optimal controller, and the controller also requires less control effort for tracking. The third simulation compared the controller with a traditional IBVS controller, and showed that the controller requires much less control effort and achieves a smaller tracking error. This work has demonstrated both the control efficacy and high performance of the developed controller.