Tracking a Non-Cooperative Target Using Real-Time Stereovision-Based Control: An Experimental Study

Tracking a non-cooperative target is a challenge, because in unfamiliar environments most targets are unknown and unspecified. Stereovision is suited to deal with this issue, because it allows to passively scan large areas and estimate the relative position, velocity and shape of objects. This research is an experimental effort aimed at developing, implementing and evaluating a real-time non-cooperative target tracking methods using stereovision measurements only. A computer-vision feature detection and matching algorithm was developed in order to identify and locate the target in the captured images. Three different filters were designed for estimating the relative position and velocity, and their performance was compared. A line-of-sight control algorithm was used for the purpose of keeping the target within the field-of-view. Extensive analytical and numerical investigations were conducted on the multi-view stereo projection equations and their solutions, which were used to initialize the different filters. This research shows, using an experimental and numerical evaluation, the benefits of using the unscented Kalman filter and the total least squares technique in the stereovision-based tracking problem. These findings offer a general and more accurate method for solving the static and dynamic stereovision triangulation problems and the concomitant line-of-sight control.


Cooperative vs. Non-Cooperative Targets
An important problem in the field of computer vision is estimating the relative pose. Numerous past studies addressed this problem, dealing with objects on which some information is a priori available; these objects are referred to as cooperative targets.
To estimate the relative pose with respect to cooperative targets, Fasano, Grassi and Accardo [1] used light emitting diodes, while Woffinden and Geller [2] used known shapes as features, placed at known positions on the target. Terui, Kamimura and Nishida [3] used a known 3D model of the object, a model matching technique, and 3D feature points obtained from stereo matching.
The problem of estimating the position, velocity, attitude and structure of a non-cooperative target significantly surpasses the cooperative problem in complexity. Segal, Carmi and Gurfil [4,5] used two cameras for finding and matching a priori unknown feature points on a target. These feature points were fed to an Iterated Extended Kalman Filter (IEKF) [6], which was based on the dynamics of spacecraft relative motion.
Also, the target's inertia tensor was estimated using a maximum a posteriori identification scheme. Lichter and Dubowsky [7,8] used several vision sensors, which were distributed fairly uniformly The CMKF is less known compared to the EKF and UKF. The main idea in CMKF is to convert the measurement equation into a linear equation by an algebraic manipulation. The noise models in the converted equations are not necessarily identical to the noise model in the original equations, and, as a result, a biased estimation might occur. The CMKF was first developed by Lerro and Bar-Shalom [23] for tracking systems that measure the position of targets in polar coordinates. They also added bias compensation components, referred to as "Debiased CMKF". Later on, Suchomski [24] proposed a three-dimensional version of this filter. A second version of the CMKF was proposed by Xiaoquan, Yiyu and Bar-Shalom [25], which was referred to as "Unbiased CMKF". It included different bias compensation components. Later on, it was also developed using a different bias compensation technique [26].
An important issue with implications on the estimation process is the Depth Error Equation (DEE), which is an analytical equation that yields an approximation of the Line-of-Sight (LOS) vector's depth component. The DEE can be expanded to all of the LOS components, and is referred to as Relative Position Error Equations (RPEE). The RPEE yields an analytical approximation of the expected errors of the relative position components. Gallup, Frahm, Mordohai and Pollefeys [27] investigated the DEE with changing baseline and resolution in order to compute a depth map over a certain volume with a constant depth error. Oh et al. [28] evaluated the DEE via experiments.
A direct solution of the stereo measurement model is referred to as Converted Measurements (CMS). The CMS are used for initialization of the estimators, and as a reliability check for each measurement. That is, if the CMS and the state estimation are too dissimilar, than the measurement's validity should be questioned. In the case of stereovision with two horizontally positioned and vertically aligned cameras, the measurement model is comprised of four algebraic, coupled, overdetermined and non-linear equations. By rearranging these equation, the model's non-linearity can be eliminated at the cost of certain numerical issues. In this case, the CMS can be attained by various methods, such Least Squares (LS) [6] and Total Least Squares (TSL) [29].

Research Contributions
The current research is an experimental effort, in which non-cooperative target tracking methods using stereovision measurements are developed. Real-time closed-loop line-of-sight control based on stereovision measurements only is developed and evaluated. In particular, our main contribution is the development of a stereovision tracking algorithm of non-cooperative targets suitable for real-time implementation. Moreover, we examine different estimation schemes, and evaluate their real-time performance. An emphasis is given on devising efficient methods for initializing the relative position estimators based on stereovision measurements.
The remainder of this paper is organized as follows. Section 2 introduces the computer vision models and algorithms used throughout this study. In Section 3, the estimation models are introduced, including the CMKF, which is a novel estimator. Section 4 describes the control algorithm, which is implemented in the experiments. In Section 5 a numerical analysis of the problem described in Section 3.4 is conducted. Section 6 describes the Distributed Space Systems Laboratory (DSSL) hardware, the experimental system, and three experiments, which implement the algorithms developed in this research. Section 7 contains the conclusions of this work.
Notice that Sections 2, 3 and 5 address a 3D problem, while Section 4 addresses a 2D problem. The reason for that is that the estimation algorithms are designed with the intention to be general, while the control algorithm has to be adjusted to the 2D setup of the robots in the DSSL. As a result, the control algorithm uses only the relevant 2D components of the estimated 3D quantities.

Pinhole Model
Consider N cameras, which are only separated by translation, and have no relative rotation, positioned at N different fixed points in a three-dimensional space, facing the same direction, and sharing a common image plane. A reference coordinate system is denoted by {x 0 , y 0 , z 0 }. The camera-fixed coordinates are denoted by {x i , y i , z i }. The reference coordinate system and the camera-fixed coordinate systems are rotating together and are facing the same direction at all times, so no transformation is needed between them. The vectors b i (i = 1, ..., N) connect the reference coordinate system with the N camera-fixed coordinates.
The vector ρ connects {x 0 , y 0 , z 0 } with an arbitrary point in space, A. Assuming A appears in the Fields of View (FOV) of all the cameras, the vectors h i (i = 1, ..., N) connect each camera's respective coordinate origin with point A. Without loss of generality, it is assumed that all the cameras are facing the y direction of their respective coordinates. This geometry is depicted in Figure 1a.

Image Resizing
Consider an image with the dimensions X 1 and Y 1 and focal lengths f x base , f z base in the x and y directions, respectively. Assume that it is desired to resize that image to the dimensions X 2 and Y 2 . To that end, R F is defined as the Resize Factor. For simplicity, we assume that the image is resized while keeping the aspect ratio constant. The relation between X 1 , X 2 , Y 1 , Y 2 and R F is When images are resized, their focal length should be corrected correspondingly, where f x and f y are the focal lengths of the resized image.

Relative Position Measurement Error Approximation
The error of the measured relative position ρ m for a simple case of two horizontally positioned and vertically aligned cameras, is approximated in the following manner. It is assumed that the focal lengths of the cameras are equal, In Figure 2, depicting the geometry of aligned cameras, it can be seen that z x where b is the distance between the cameras, also referred to as baseline, and d is the disparity [17,21]. The linear projection Equations (6) for this case become A two dimensional geometry of two aligned cameras and a point's projection on their respective image planes.
Equations (11) and (12) are referred to as the Horizontal Equations, while Equations (13) and (14) are referred to as the Vertical Equations. These equations can be solved using the Combined Vertical Equations (CVE) method. The CVE approach, developed in this research, addresses the specific problem of stereovision with two horizontally positioned and vertically aligned cameras. According to this approach, it is noticed that the vertical equations in both cameras are identical up to measurement errors. Therefore, to render the linear system determined, the vertical equations are averaged into one equation. Consequently, a solution can be achieved by simply inverting the matrix in the left-hand side of Horizontal Equations, which becomes a 3 × 3 matrix. The solutions according to the CVE method for ρ x , ρ y , ρ z are given by The horizontal error, depth error and vertical error, denoted by ∆ρ x , ∆ρ y , ∆ρ z , respectively are defined as where ∆z x 1 , ∆z x 2 , ∆z z 1 , ∆z z 2 are the errors in z x 1 , z x 2 , z z 1 , z z 2 , respectively. Furthermore, it is assumed that By combining Equations (15)- (17), the following expressions are obtained: Finally, the relative position vector is approximated as where ρ m is the converted measurement, calculated using Equations (11)- (14), and ∆ ρ is the additive error vector.

Computer Vision Algorithms
To utilize the stereovision camera as a sensor, the target has to be identified in the scene, following the extraction of its position in the image plane in all cameras. To that end, a methodology which relies on feature detection and matching, image segmentation and target recognition algorithms is used.

Feature Detection and Matching
Feature detection and matching [17] is used for matching the target in all cameras, by detecting feature points in the images and assigning to each point a descriptor with encoded visual information. Consequently, it is possible to find the same point in different images by matching the descriptors numerically.
Over the years, different kinds of feature point detection methods were developed. The most well known is SIFT [18], and there are others such as SURF [19], Maximally Stable Extremal Regions (MSER) [30], Features from Accelerated Segment Test (FAST) [31] and Oriented fast and Rotated Brief (ORB) [32]. Each method relies on a different geometric concept, and has different properties, such as computational effort and reliability.
In this research, SURF was used. The computational effort of SURF is significantly smaller than SIFT, which was used by Jigalin and Gurfil [10]. The feature matching methods used herein are: 1. Nearest Neighbour-The descriptors of the feature points of all the cameras are matched by comparing them using Euclidian distances. The reliability of the matches increases by eliminating ambiguous matches, which are defined as matches that have a low ratio between their distance and the distance of the second-closest neighbor (see [18]).

Fundamental Matrix Estimation Using RANSAC [21]-This method eliminates the outliers of the
Nearest Neighbour method by applying geometric constraints. 3. Slope Cut Off-This method eliminates more outliers by enforcing geometric consistency. It places the stereovision images side by side and stretches lines between all the matched points. For every object, all the lines should have a similar slope, so the lines should not cut each other. The matches of lines that do not adhere to these rules are declared as outliers and rejected.

Image Segmentation
The following image segmentation method [10] is used for locating all the target candidates in the image plane. A target candidate's projection on the image plane is assumed to be a notable blob. With this assumption, the images from all the cameras are subjected to the following procedures 1. The images are processed by a Sobel edge detector [17], which returns binary images of the edges in the scene. 2. The binary images are filtered in order to clean noisy pixels. 3. The images are processed through an image dilation operator, which expands the outlines of the objects in the scene. This operation closes all the gaps in the objects' outlines. 4. After the gaps in the objects' outlines are closed, the objects, which are closed outlined blobs, are filled. 5. Finally, the images are processed by an erosion operator, which diminishes all the blobs in the images. This operation leaves only the most significant blobs.
All the remaining separate blobs are considered as target candidates.

Target Recognition
The next step is to determine which of the obtained target candidates is the real one. In order to do so, the following method is used. The method starts by calculating the area, solidity and distance of each target candidate's region. Area is the number of pixels in each region. Solidity is the area fraction of the region with respect to its convex hull (0 < Solidity < 1). Distance is the mean distance of SURF feature points in each region.
The distance value for each feature point is simply the y component of the LOS vector ρ to each feature point. These vectors are obtained by solving Equation (6) for each feature point. An important issue that has to be dealt with, is the method which is used to solve Equation (6). Among other methods, it can be solved using LS or TLS, which were mentioned above. A comparison is given in Section 5.
The properties (Area, Solidity and Distance) of each target candidate has to be bounded within certain upper and lower bounds. Each target candidate, whose properties are not limited to these bounds, is neglected. Among the target candidates, which meet the previous criteria, the true target is selected using the relationship where N c is the number of the target candidates. It can be seen that this formula gives a preference to larger, closer and more convex objects using a non-dimensional value.

Process Model
We define I as a cartesian right-hand inertial reference frame. The state vector, x, contains the relative position and velocity between the chaser and the target in the inertial coordinate frame I, To write a process model, it is assumed that the target's mass, inertia, applied forces and torques are unknown; therefore, an exact dynamical model, which describes the relative state between the chaser and the target, is quite complicated to derive. Instead, a white-noise acceleration model [6] can be implemented. This model is less accurate, but also does not require much information regarding the target. The continuous process model is written aṡ where ν(t) ∈ R 6×1 is a white Gaussian process noise vector defined as with Q c being the power spectral density, and The next step is to write the discrete-time process model with a sampling period ∆t, where Φ is the discrete state transition matrix, The discrete process noise vector ν k satisfies where Q is the covariance matrix.

Measurement Model
The non-linear projection equations (Equation (5)) for N cameras satisfy the relation where z denotes the coordinates in pixels of the Center of Projection (COP) of the target relative to the center of the images. f x 1 , f z 1 , ..., f x N , f z N are the focal lengths in pixels for each camera and each direction separately. w is a zero-mean Gaussian measurement noise vector and R is its covariance matrix,

CMKF
The CMKF [23][24][25][26] is a less common filter than the EKF and the UKF, and it cannot be implemented on all non-linear systems; only the measurement equations are allowed to be non-linear (the process equations are compelled to be linear) and not all non-linearities can be dealt with. The main idea in the CMKF is to rearrange the non-linear measurement equations into linear equations.
The CMKF is a linear filter, and therefore has a stability proof, which leads to robustness to different initial conditions and to large time increments. On the other hand, the noise in the rearranged measurement equations is not necessarily white, and it may be difficult to determine its statistics. An inaccurate assessment of the noise statistics can result in a biased estimation of the state.
The CMKF fits quite well to the case of relative position estimation using stereovision with no a-priori information. The process equation is linear (Equation (26)) and the non-linearity in the measurement equation (Equation (5)) can be dealt with (Equation (6)).
Equation (6) shows that This is a linear system, A ρ m = b. The unknown vector ρ m can be evaluated by solving Equation (33) using different methods, as discussed in Section 5.
Although the vector ρ m does not necessarily contain additive noise, it was approximated in Section 2.3 using a linear model Equation (34) is the CMKF measurement equation, where w is a zero-mean measurement noise vector and R is its covariance matrix.
where σ ρ x , σ ρ y , σ ρ z are approximated using Equation (18), ∆z is the standard deviation of the coordinates of the centroid of the projection of the target onto the image plane, assuming that the standard deviation ∆z is the same in the x and z directions in both cameras. The rest of the CMKF equations are standard Kalman Filter (KF) equations. The full details of the CMKF algorithm are described in Algorithm 1, where P, Q, R are the state covariance, process noise covariance and measurement noise covariance respectively, Φ is the discrete state-transition matrix, K is the gain matrix, k is the time step index and ρ m is an estimation of the LOS vector ρ.

Filters Initialization
By using the methods described in Section 2.4, the obtained measurements are processed in order to initialize the filters through Equation (6), which can be solved using several methods, e.g., LS, TLS and CVE. These methods solve the overdetermined system of linear equations A x = b, where A is a matrix with more rows than columns. In the LS approach [6], there is an underlying assumption that all the errors are confined to the observation vector b; its solution is x = (A T A) −1 A T b. In the TLS approach [29] it is assumed that there are errors both in b and A. The CVE approach was presented in Section 2.3. In order to determine the preferred method of solution, the performance of these methods will be examined in Section 5.

Estimation Scheme
The estimation scheme in Figure 3 describes the implemented algorithm. Every step, a measurement is acquired. If a target has not been found yet, or the algorithm was initialized in the last step, then a target recognition algorithm is activated. Only if a target was found by the target recognition algorithm, the different filters are initialized. After each filtering step, the algorithm checks if the estimation is reliable. Estimation is declared as non-reliable if the produced state is not reasonable in relation to the laboratory dimensions or to the state of the previous step. A non-reliable state can also be produced in case there has been too few matched feature points for several sequential steps. This happens mostly if the target has left the scene. If the state is unreliable, the algorithm is re-initialized, meaning that the filters are initialized and the target has to be found again.  Figure 3. The estimation logic. This scheme consists of a filter re-initialization step, which is activated in case the estimation is no longer reliable.

Target Tracking
The LOS between the chaser and the target is the vector ρ, defined in previous sections. It determines whether the target is visible to the chaser.
The direction of the LOS can be expressed using azimuth and elevation angles. In general, the tracking algorithm has to keep these two angles bounded. In the two dimensional case, only the azimuth angle has to be controlled. The bounds of these angles are determined by the camera rig's FOV, which is determined by the cameras' properties and the geometric arrangement. Figure 4 depicts two vertically aligned and horizontally positioned cameras, and their combined FOV, denoted by StereoView. α is the horizontal Angle-of-View (AOV), baseline is the distance between the cameras and Range is the distance between the cameras and the target. The following relation can be inferred:

FOV Azimuth Angle Bounds
It can be seen that the use of stereovision diminishes the combined FOV. More specifically, enlargement of the baseline leads to a smaller StereoView. On the other hand, increasing the baseline also leads to a higher accuracy (Equation (18)). By substituting StereoView = 0 into Equation (37), The minimal Range is calculated.
The horizontal AOV of each camera is expressed as where f is the focal length. f base and R F are the basic focal length and resize factor, respectively, as defined in Section 2.2. SensorWidth is the width of the optical sensor.
Range α StereoView baseline  In order to maintain the target within StereoView, φ max is defined as the maximal allowed azimuth angle for a certain Range. From the geometry depicted in Figure 5, φ max is calculated,  Figure 6 depicts the chaser, the target and the LOS vector ρ. x I c and x I t are the chaser and target's inertial position vectors, respectively. θ c is the chaser's body angle. C is a cartesian right-hand body-fixed reference frame attached to the chaser. x C , ŷ C , x I and ŷ I are the principal axes of the chaser's body frame and the inertial frame, respectively. Assume that the camera rig is installed on the chaser along the ŷ C direction. φ is the azimuth angle and is calculated as  Figure 7 depicts the LOS control algorithm. In the tracking algorithm, it is desired to maintain the target in the FOV center, or in other words, to minimize |φ|. It is done by a Proportional Derivative (PD) controller. The signals x I c and x I t are constantly changing and, therefore, φ will not converge to zero. In Section 3.5 it was mentioned that there are cases where the state is unreliable and the estimation algorithm is initialized. In that case, there is no estimation of the vector ρ, and the tracking algorithm cannot be implemented. For the sake of simplicity, in those cases, a default reference value of 0 degrees is used for θ c , which makes the robot look straight ahead to the direction of the target, assuming it is not moving.

Tracking Algorithm Stability
The closed-loop dynamics in Figure 7 can be written as where K and D are the PD controller proportional and derivative gains, respectively. J is the polar moment of inertia of the chaser. Notice that the properties of the chaser are assumed to be known. The stability analysis will be carried out around an equilibrium where the chaser is looking directly at the target. For simplicity, it is also assumed that the chaser and the target share the same x I coordinate.
In other words, At this equilibrium, Equation (42) yields It can be seen that for K, D > 0 (J > 0 because it is a moment of inertia), Equation (44) represents an asymptotically stable system.
This system is also asymptotically stable for a more general case, where the chaser is looking directly to the target, but the LOS vector is not aligned with the inertial coordinate system I, that is, In this case, in order to prove stability, it is required to define a new inertial coordinate frame, which is rotated by (θ c ) eq with respect to I. In the rotated inertial coordinate frame, Equation (43) is satisfied and the stability proof holds.

Numerical Study
To initialize the filters (Section 3.4) and the distance value for each feature point (Section 2.4.3), it is required to solve Equation (6). In this section, a numerical evaluation will compare the solutions of Equation (6) using the LS, TLS and CVE approaches.

Static Problem Description
Consider a static object, which is viewed by a camera rig with either 2 or 4 cameras. In the case of 2 cameras, the cameras are vertically aligned and horizontally positioned with a baseline of length b. In the case of 4 cameras, the cameras are located at the vertices of a square with edges of length b.
Each camera acquires measurements of the viewed object according to Equation (5). Each measurement is corrupted with a zero mean Gaussian noise with a standard deviation ∆Z and is rounded to the closest integer (because the measurements are in pixels). The noisy measurements are then used for writing Equation (6), which is solved using LS and TLS. In the case of 2 cameras, CVE is also used.
This routine is carried out in 5000 Monte-Carlo runs, which produce estimations for the relative position of the object relative to the camera rig. RANSAC [20] was used in order to detect and reject outliers. Consequently, approximately 4% of the estimations were considered outliers and were discarded. The parameter values used in this numerical estimation are summarized in Table 1.

Parameter
Value Units

Results
Tables 2 and 3 provide the statistical properties of the histograms depicted in Figures 8 and 9, respectively. µ and σ denote the mean value and standard deviation, respectively. (σ LS ) Est is the estimation of the standard deviations of ρ x , ρ y , ρ z in the case of 2 cameras, which is estimated using Equation (18).
Although the estimation bias is not negligible with LS, TLS and CVE with 2 or 4 cameras, TLS and CVE produce significantly less bias than LS. Also, the use of 4 cameras does not always yield less bias compared to 2 cameras. In the y direction, TLS and CVE produce slightly greater dispersion than LS. As expected, the use of 4 cameras yields less dispersion than 2 cameras. (σ LS ) Estimation is a fair approximation of σ LS because their corresponding components share the same magnitude.
According to these results, for the case of 2 cameras, LS, TLS, and CVE have similar performance. In this research TLS is used. For the case of 4 cameras, TLS is preferable over LS because of the bias improvement, while the dispersion is only slightly greater.

Experimental Results
The algorithms developed herein were tested in a series of experiments. These algorithms include image acquisition software, and control software, which achieve real-time performance. As mentioned in Section 1, the experimental part includes 2 cameras.
The sampling frequency of the computer-vision software is mainly dependent on the number of discovered feature points, and the size of the target projection in the frames. As a result, the computer-vision software's time step varies between 0.4 and 0.6 s. The control software operates at a constant sampling frequency of 30 Hz.
The initial state covariance for all the filters is a 6 × 6 diagonal matrix, whose main diagonal is [4 m 2 , 4 m 2 , 4 m 2 , 225 m 2 /s 2 , 225 m 2 /s 2 , 225 m 2 /s 2 ]. The values for [σ x , σ y , σ z ] for the process noise covariance, as mentioned in Section 3.1, are [1,1,0] m/s 2 , [0.5,0.1,0] m/s 2 and [0.8,0.8,0] m/s 2 for the EKF, UKF, and for the CMKF, respectively. Because only 2 cameras are used, the measurement noise covariance for the EKF and UKF is a 4 × 4 diagonal matrix. Its main diagonal is [25 px 2 , 100 px 2 , 25 px 2 , 100 px 2 ] For the CMKF, the measurement noise covariance is calculated in each step, as described in Section 3.3. For that, ∆Z is assumed to be 10 pixels. Also, the UKF algorithm in this research uses a simple uniform set of weights [13].

Laboratory Description
The experimental part of this research was conducted in the DSSL at the Technion. The DSSL is equipped with an air table, floating robots (Figure 10a), communication system, main computer, a stereovision camera rig ( Figure 10b) and an overhead camera (Figure 10c), which measures the position and orientation of the robots on the air table.
(a) Three robots placed on the air table.
(b) Chaser equipped with a stereovision system.
(c) Image taken by the overhead camera's software. Each robot is recognized in the scene, including its orientation and its velocity direction. Each robot is characterized by four circles with different colors in different order (Figure 10c). The overhead camera uses those colors to identify each robot and determine its orientation. The overhead camera is calibrated so as to relate the position of each robot in the image plane to its corresponding inertial position. Figure 11 depicts the chaser with the camera rig facing the target. As can be seen in Figure 11, the camera rig contains four cameras, which are all aligned. Notice that although the mounted camera rig does include four camera, the experimental part of the research utilizes only two of them (the top two, which are marked as "Cam1" and "Cam2" in Figure 11).
Cam3 Cam4 Figure 11. The body frame, camera numbers and LOS vector, shown in the laboratory environment.

Reference Signals
In the experiments, the values ρ C x , ρ C y , φ, θ c are estimated and compared to reference values, where the corresponding reference values are calculated in the following manner. The overhead camera (see Section 6.1) acquires measurements of the position of the chaser and the target in the inertial frame x I c , x I t , and the orientation of the chaser θ c . ρ C x and ρ C y are calculated using the equation The LOS azimuth angle φ is calculated using Equation (41).

FOV Properties
The parameter values used in this research are given in Table 4. Using Equations (37), (38), (40) and Table 4, StereoView and Range min are calculated,

Static Experiment
A static experiment was conducted, wherein the target and chaser are placed at fixed locations. The purpose of this experiment is to test the target recognition algorithm. Due to measurement noise, which is manifested by a different number of feature points and locations, the target's COP in each frame is slightly different, as seen in Figure 12. It is desired to measure this dispersion in order to use it in the estimation algorithm.  In this experiment, due to the prominence of the color of the target's bottom compared to its top, a larger number of features was found at the bottom. Consequently, the COP location was estimated to be lower than expected. As seen in Equation (15), this has minor to no effect on the components of the LOS estimation in the x and y directions. The main influence on these components is the COP's horizontal position. As noted in Figure 13, the dispersion of the target's COP is Although the dispersions in both directions share the same magnitude, it is reasonable to assume that σ y is greater than σ x due to the target's shape, which is characterized by a slightly longer vertical dimension. Figures 13 and 14 depict the LOS horizontal components and the azimuth estimation, respectively. It can be seen that the estimation of ρ x and ρ y , and consequently φ, are biased, and that all of the estimated values are constantly over-estimated; that is, the estimated values are larger than the true values. The bias values are approximately It can also be seen that in the static estimation scenario, due to the slight changes in the target's COP in each time step, the estimates of the filters do not converge, but stay around the average value. By using these results, it is not definite which estimator works best. The use of the CMKF did not produce better results than the EKF and UKF.  Figure 13. A comparison of ρ x (top) and ρ y (bottom) estimation using different filters in the static experiment.  Figure 14. A comparison of the azimuth estimation using different filters in the static experiment.

Semi-Static Experiment
A semi-static experiment was conducted where the target is stationary and the chaser had to maintain its position and orientation in order to keep the target within its FOV, while dealing with dynamic modeling inaccuracies and external disturbances.
Three impulses of external torques were applied during this experiment. These torques are marked in the different figures. Figure 15 depicts the chaser's body angle θ c . It can be seen that after each impulse of external torque, θ c converges to its nominal value with a characteristic response of a second-order system with two complex stable poles. This is expected, based on the rotational closed-loop dynamics modeled in Section 4.2.  Between 116 and 120 s, the target left the chaser's FOV. As a result, θ c 's desired value is zero. While the chaser rotated towards θ c = 0, the target returned to the FOV, the target recognition algorithm detected it and tracking was resumed. Figure 16 depicts the LOS components estimation as well as the reference values, where the reference values calculation is described in Section 6.2. The most prominent property in these graphs is the bias in the ρ y estimation. Most of the time this bias has a constant mean value with oscillations and occasional peaks. There is also bias in the ρ x estimation, but unlike the ρ y bias, the ρ x bias has negligible oscillations. Also, a good correspondence between the ρ x estimation and its reference is notable.  The absence of the target in the chaser's FOV is seen in approximately 116-120 s and shortly after this event, the estimators diverge. In these sorts of events, the control algorithm is programmed to ignore the unreliable current estimate, and calculate the control signals according to a nominal predefined state. This divergence event ended when the target entered the FOV and tracking was resumed. Figure 17 depicts the azimuth angle φ and its reference. φ is calculated using ρ x and ρ y , and, therefore, all the different phenomena that occurred in ρ x and ρ y are reappearing in φ as well. It is also notable that the φ estimation is characterized by bias and delay. At t = 146, it can be seen that the CMKF and the EKF estimates had a large error compared to the UKF estimate. It is important to note that in this experiment the control signals were calculated using the UKF output. Figure 18 depicts the torque commands versus the applied torque. As seen, the thrusters provide a discrete control torque although the commanded value is continuous.

Dynamical Experiment
A dynamical experiment was conducted, wherein the target moved in a cyclic trajectory. The chaser's objective was to maintain its position and change its orientation in order to keep the target within its FOV. Figure 19 depicts the trajectories of the chaser and the target during the experiment. In order to better illustrate the experiment, four points in time were selected. For each, the position and orientation of the chaser and the position of the target were marked.   Figure 20 depicts the LOS components estimation and references. A bias is seen in the ρ y estimation. It is clearly seen that the EKF has the worst performance. On the other hand, the ρ x estimation has good correspondence to its reference with all estimators. Figure 21 depicts the azimuth angle φ during the experiment. Although ρ y was constantly over-estimated, it can be seen that the azimuth angle has a good correspondence to its reference without bias.
During approximately 190-194 s and 227-231 s, a large error is seen in the azimuth angle. These large errors had likely occurred due to rapid and substantial changes in the relative position. It is important to remember that the estimation algorithm has an approximately half a second delay.   Figure 22 depicts the torque commands versus the applied torque. As seen, the control signal oscillates, which implies that the dynamical system does not reach an equilibrium. This happens because as long as the target continues moving, the chaser has to keep adjusting its orientation. Applied Torque Torque Command Figure 22. A comparison between the continuous control torque signal, the discrete applied torque and the PD signal.

Error Analysis
The results in Sections 6.4-6.6 exhibit a bias in the LOS depth component's estimation. It seems that for stereovision with two cameras, bias is an inherent property of the system. This bias was also seen in Figure 8 and in previous work [10].
An interesting issue worth discussing is the fact that although the LOS depth component ρ y is biased, the estimated azimuth angle φ is quite accurate. Recall that and hence Equation (53) shows that the errors ∆ρ x , ∆ρ y contribute to the error ∆φ, but they are mitigated by ρ y and ρ 2 y , respectively. Since ρ y has a relatively large value, it reduces the error effect dramatically.

Conclusions
In this research, a stereovision based algorithm for detection of non-cooperative targets and estimation of the relative position in real-time was developed. The detection algorithm does not require any a priori information except the assumption that the target is the most dominant object in the image. Moreover, a real-time tracking algorithm, which utilizes the stereovision information and keeps the target within the FOV, was developed.
A numerical study, which studies solution methods for the measurement equations was performed, and experiments which utilized the different algorithms were conducted. In these experiments, the performance of three estimators was compared using experimental data.
Real-time performance of the estimation and LOS control algorithms was demonstrated in the experiments carried at in the DSSL. The experimental and numerical results show that there is a non-negligible bias in the LOS depth component estimation with all of the estimators. On the other hand, the LOS horizontal component is estimated with a much smaller bias. The semi-static experiment exhibited the target detection algorithm performance in case where the target leaves and returns to the FOV.
Although the three estimators have similar behaviour, it seems that the EKF has the poorest performance. Deciding which estimator is better, the UKF or the CMKF, is not trivial, because most of the time their performance is similar. On the other hand, in the semi-static experiment, it is shown that unlike the CMKF, the UKF manages to cope with rapid and substantial changes in the dynamics, and therefore it outperforms the CMKF.
By using SURF feature points, simplifying the detection algorithm and using a small state vector, the average time step duration is reduced compared to previous work to approximately 0.4 s, which for most applications, is approximately the upper bound for real time operation.