A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera

Wang, Shun; Xu, Sheng; Ma, Zhihao; Wang, Dashuai; Li, Weimin

doi:10.3390/s23104862

Open AccessArticle

A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera

by

Shun Wang

¹,

Sheng Xu

^1,*

,

Zhihao Ma

^2,3,

Dashuai Wang

^1,4

and

Weimin Li

^1,2,*

¹

Guangdong Provincial Key Lab of Robotics and Intelligent System, Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS), Shenzhen 518055, China

²

Shandong Institute of Advanced Technology, CAS, Jinan 250102, China

³

School of Control Science and Engineering, Shandong University, Jinan 250061, China

⁴

School of Microelectronics, Southern University of Science and Technology, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Sensors 2023, 23(10), 4862; https://doi.org/10.3390/s23104862

Submission received: 23 April 2023 / Revised: 12 May 2023 / Accepted: 16 May 2023 / Published: 18 May 2023

(This article belongs to the Special Issue Image Processing and Analysis for Object Detection)

Download

Browse Figures

Versions Notes

Abstract

:

This paper focuses on moving-target detection and tracking in a three-dimensional (3D) space, and proposes a visual target tracking system only using a two-dimensional (2D) camera. To quickly detect moving targets, an improved optical flow method with detailed modifications in the pyramid, warping, and cost volume network (PWC-Net) is applied. Meanwhile, a clustering algorithm is used to accurately extract the moving target from a noisy background. Then, the target position is estimated using a proposed geometrical pinhole imaging algorithm and cubature Kalman filter (CKF). Specifically, the camera’s installation position and inner parameters are applied to calculate the azimuth, elevation angles, and depth of the target while only using 2D measurements. The proposed geometrical solution has a simple structure and fast computational speed. Different simulations and experiments verify the effectiveness of the proposed method.

Keywords:

moving-target detection; optical flow; monocular vision; 3D target tracking; cubature Kalman filter

1. Introduction

Moving-target detection and tracking is widely required by and applied to various applications, such as security systems, obstacle detection, and search-and-rescue missions [1,2,3,4]. Consequently, the research topic of moving-target detection and tracking has attracted much attention. Previous studies about moving-target detection and tracking can be divided into different methods according to the applied sensors, e.g., lidar-, sonar-, somatosensory-, and visual-based methods [5,6]. The lidar-based method is mainly used for indoor applications, and a certain number of reflective plates need to be installed in the environment. In addition, the installation accuracy and installation location have high requirements; thus, costs are high. The sonar-based method is always used for underwater scenarios. The somatosensory-based method requires some special sensors to be carried by the targets, which are not available, especially in security applications. Compared with the above transmission methods, the visual-based method is the most popular in civilian applications. Due to the rapid development of camera sensors, the visual-based method has many advantages, e.g., low cost, a high signal-to-noise ratio, convenient transmission, and fast operation speed [7,8]. Therefore, it is an appropriate choice for moving-target detection and tracking, even for microtargets [9,10,11].

In computer vision, moving-target detection has two steps, i.e., removing redundant information, and extracting the region with positional changes in the image sequence. In the 1970s, Jain R. et al. proposed a method to detect moving targets by using scene changes between adjacent frames called the frame difference method [12]. The main advantage of the frame difference method is the low computational complexity. Thus, it is simple to implement the algorithm, and it can adapt to different dynamic environments with various lighting conditions. However, it cannot detect all the pixels of a moving target [13], and the accuracy of the detection result is limited [14]. In [15], the traditional frame difference method was improved with a three-frame difference method to calculate the difference. A moving-vehicle video detection method that combines ViBe and frame difference was proposed in [8]. It improves the background update efficiency and can detect moving vehicles in the video. The background difference method developed in [16] has low complexity. However, it has the “tailing” problem and requires constant illumination. In [17], the difficulty of the frame difference, i.e., background modeling and real-time updating of subsequent models, was overcome. The authors of [18] proposed a moving-object detection method through the brief background subtraction technique. In [19], an intelligent background differential model was designed for training target monitoring, and a background difference method based on RGB color separation was proposed.

The optical flow technique is an effective, widely applied method. In 1981, Horn and Schunck first connected an image’s grayscale with the optical flow vector and proposed the basic model of optical flow calculation [20]. Subsequently, on the basis of this model, various methods for optical flow calculation have been developed. The Lucas–Kanade optical flow method is a widely used difference algorithm for optical flow calculation. However, the Lucas–Kanade optical flow method needs to simultaneously meet three assumptions, so it is challenging to meet these strong constraints [21]. The pyramid optical flow method was created to compensate for the Lucas–Kanade optical flow method [22]. With the development and successful application of deep-learning technology in images [23], many computing methods have been developed by introducing deep learning into the field of optical flow estimation. In 2016, SpyNet was able to estimate the motion between two images by using only a small network with the help of a spatial pyramid and image mapping [24]. PWC-Net was proposed in 2018 to construct the image feature pyramid on the basis of the traditional pyramid optical flow method. At the same time, the number of network parameters is significantly reduced to about 0.06 times that of FlowNet2.0 [25]. In 2018, lightweight optical flow estimation network LiteFlownet was proposed [26]. MaskFlownet was proposed in 2020 [27]. Compared with PWC-Net, it effectively overcomes the ghosting problem and improves detection accuracy. Some solutions using artificial intelligent algorithms were also proposed at the expense of computational complexity [28,29].

According to the number and characteristics of visual sensors, existing location systems can be roughly divided into multi-vision locating, binocular vision locating, and monocular vision locating [30]. Multi-vision can obtain more visual information, but the structure of the visual controller is complex [31], and real-time performance is poor. Binocular vision needs to accurately know the spatial position relationship between the two cameras, collect two images from different perspectives, and then calculate the 3D coordinates of a target according to the pixel difference of the two images through the triangulation principle. Stereo matching is complex [32]. Monocular vision locating only needs to estimate the target orientation and distance information through the inherent performance parameters of the camera, which is low-cost and easy to implement. The system structure is relatively simple and has low requirements for computing resources, so it has been widely used [33]. On the basis of monocular vision, a target azimuth estimation algorithm in range-gated imaging was proposed in [34] representing the target azimuth through the projection relationship between coordinate systems and trigonometric function. A systematic procedure of model-referenced pose estimation was introduced in [35] to obtain the relative pose information between an underwater vehicle and underwater structures whose geometry and shape are known. The authors in [36] proposed an iterative algorithm that could estimate a target’s pose from an image according to the size of the target. On the basis of the geometric model of camera imaging, the camera’s parameters and the principle of an analog signal could be transformed into a digital signal.

Conventional moving target tracking based on visual image information focuses on the 2D image level and cannot perform realizations in a 3D space. However, 2D tracking has certain limitations. It can only obtain motion information from two directions of the moving target. However, the realization of the 3D tracking of moving targets is fundamental in industrial control. In recent years, there have been many visual devices to obtain the 3D spatial information of moving targets, such as the RealSense depth camera, Kinect camera, and Leap Motion somatosensory device [37]. However, such devices are usually expensive, have different limitations, and are always used in academic research fields [38]. Using computer technology and an ordinary optical camera to obtain 3D information at a low cost is greatly significant. The Kalman filter (KF) is the most commonly used tool for the 3D tracking of moving targets. For nonlinear motion, researchers improved the KF and then proposed many suitable methods for nonlinear motion, such as the extended Kalman filter (EKF), unscented Kalman filter (UKF), quadrature Kalman filter (QKF), and cubature Kalman filter (CKF) [39,40,41].

To summarize, there are some solutions for moving-target detection and tracking based on visual images, but the limitations of their complexity and cost have not been fully addressed. Specifically, some key problems are listed below. (1) The noise problem seriously affects the detection effect of moving targets. (2) Most previous papers about visual-based target detection used a 3D camera, which has the disadvantages of inaccurate depth estimation and complex computation. Previous methods only using a monocular camera could hardly satisfy the practical requirements. (3) Systematic solutions, including improvements in target detection and estimation, have not been fully addressed. This paper achieves significant improvements according to our previous work [42]. A systematic method is proposed for mobile-target detection and tracking. The three key contributions of this paper are presented below.

An improved optical flow method with modifications in the PWC-Net is applied, and the K-means and aggregation clustering algorithms are improved. Thus, a moving target could be quickly detected and accurately extracted from a noisy background.
A geometrical solution using pinhole imaging theory and the CKF algorithm is proposed that has a simple structure and fast computational speed.
The proposed method was verified with sufficient simulation and experimental examples, which has significant value for practical applications.

This paper is organized as follows. Section 2 introduces some related previous methods and the problem formulation. The optical flow-based moving target detection and extraction method is developed in Section 3. Section 4 proposes the pinhole geometrical target state estimation algorithm in detail. A CKF algorithm is modified in Section 5 to further improve the estimated results from the geometrical solution. The experimental verification of the proposed method is provided in Section 6. Lastly, Section 7 gives the conclusion.

2. Preliminaries

2.1. Pyramid Lucas–Kanade Optical Flow Method

The classical optical flow method comprises two parts: obtaining the optical flow basic constraint equation and calculating the optical flow value. To acquire the primary constraint equation of optical flow, two assumptions should be satisfied: (i) the brightness of the same object in different pictures taken at diverse time instants should be uniform, and (ii) the positional change of the moving target between adjacent frames is small. We define the central position of a target in an image as

(x, y)

, and its gray value is

I (x, y)

. The target central point moves to position

(x + d_{x}, y + d_{y})

in the next frame after

Δ t

, and the gray value becomes

I (x + d_{x}, y + d_{y}, t + Δ t)

. On the basis of the first assumption that the brightness is uniform, the following equation holds,

I (x, y, t) = I (x + d_{x}, y + d_{y}, t + Δ t) .

(1)

On the basis of the second hypothesis, we can acquire the Taylor expansion of (1), which takes the following form:

I (x, y, t) = I (x, y, t) + \frac{\partial I}{\partial x} d_{x} + \frac{\partial I}{\partial y} d_{y} + \frac{\partial I}{\partial t} δ t + ε,

(2)

where

ε

is the second-order infinitesimal term that can be ignored since

Δ t \to 0

. Therefore, we have

\frac{\partial I}{\partial x} d_{x} + \frac{\partial I}{\partial y} d_{y} + \frac{\partial I}{\partial t} Δ t = 0 .

(3)

We define

u, v

as the velocity component of the optical flow along the coordinate axes x and y, and

u = \frac{d_{x}}{Δ t}

,

v = \frac{d_{y}}{Δ t}

.

I_{x} = \frac{\partial I}{\partial x} d_{x}

,

I_{y} = \frac{\partial I}{\partial y} d_{y}

and

I_{t} = \frac{\partial I}{\partial t}

is the partial derivative of the gray value of the image pixel along the

x, y, t

directions; thus, the following basic constraint equation of optical flow is obtained:

I_{x} u + I_{y} v + I_{t} = 0 .

(4)

In order to obtain a unique

u, v

, many calculation methods can be applied. The most popular one is the Lucas–Kanade optical flow method. To use the Lucas–Kanade optical flow method, another assumption should be held as an extra constraint equation of optical flow. The assumption is that the same neighbors’ pixels have the same motion. Specifically, the projected points of neighboring points in the 3D space are also the nearest neighbors in the image. On the basis of this assumption, the system equation of neighborhood pixels can be added, and the optical flow of regional central pixels can be obtained using the weighted least-squares method [42]. The Lucas–Kanade (LK) optical flow method needs to simultaneously meet three assumptions, and the corresponding constraints should be satisfied. Therefore, LK optical flow method cannot be applied directly. Otherwise, the calculated optical flow has large errors. The pyramid LK optical flow method is an appropriate solution [22]. In the pyramid Lucas–Kanade optical flow method as shown in Figure 1, we first need to establish the image pyramid, which includes image smoothing through a Gaussian filter, and sample collection to reduce the image resolution. The image resolution of each layer is reduced to half its value after each layer’s sampling process. Thus, the large displacement motion is gradually reduced and decomposed into several small displacement motions for optical flow accumulation calculation that are suitable for high-speed motion. In the iterative solution, layer by layer, the optical flow is calculated with the LK optical flow method from the top of the image pyramid. Then, the optical flow calculated from the image of Layer L-1 is taken as the initial value of the optical flow estimation of the Layer L-2 image. This process is repeated until the optical flow at the bottom of the image pyramid is calculated. The pyramid operations can satisfy the three assumptions mentioned before, and the optical flow can be calculated accurately. For example, if the maximal displacement of pixel motion processed by the LK optical flow method is

d_{max}

, the maximal displacement of pixel motion processed by the pyramid LK method becomes

d_{max f i n a l} = (2^{L + 1} - 1) d_{max}

. Thus, the optical flow calculation error of the large-displacement moving target is significantly reduced.

2.2. Target Localization Using Monocular Vision

Previous 3D spatial localization methods only used monocular cameras and were generally based on processing two images acquired from different positions. An example of locating one target using two images is shown in Figure 2. In order to understand the monocular visual–spatial localization method of two images, we drew a 2D plane schematic diagram, as shown in Figure 3. The positions of the camera’s optical center before and after the translation are

L_{1}

,

L_{2}

, the position of

L 1

is the origin, the camera optical axis is along the Z axis, and the perpendicular direction to the camera optical axis is the X axis that established a 2D rectangular coordinate system.

L_{1}

and

L_{2}

are located at the same coordinate axis, and the distance between them is b. When the camera’s optical center is at position

L_{1}

, the projected position of point P on the imaging plane is

P_{1}

. When the camera’s optical center is at position

L_{2}

, the projected position of point P on the imaging plane is

P_{2}

. The other geometrical details are shown in Figure 3.

On the basis of Figure 3, the following proportional relationship exists in similar triangles:

\frac{X_{1}}{X} = \frac{f}{Z},

(5)

\frac{X_{2}}{X - b} = \frac{f}{Z} .

(6)

From Equations (5) and (6), we have

Z = \frac{b f}{X_{1} - X_{2}} .

(7)

Thus, the target’s depth is obtained according to the information of the target in the image, the camera’s translation distance, and the camera’s focal length. When the depth is known, the other two coordinates of the target can be obtained as follows:

X = \frac{b X_{1}}{X_{1} - X_{2}},

(8)

Y = \frac{b y_{1}}{X_{1} - X_{2}} .

(9)

X_{1} - X_{2}

is called the parallax, and b is called the baseline.

The above method is commonly used for monocular vision 3D localization. Because the horizontal movement of the camera ensures constant coordinates along the y axis, the correction link is omitted, and then the image feature points are directly matched. By slightly moving the monocular camera, the principle of 3D target localization is similar to that of the binocular camera. However, there are also some limitations. First, the monocular camera must be mobile to obtain images from two places. Second, the results of target coordinates

X, Y, Z

are affected by parallax

X_{1} - X_{2}

. Therefore, calculation errors accumulate. The limitations of directly using a binocular camera are introduced in the last section.

2.3. Nonlinear Filter

The extended Kalman filter (EKF) is a classical nonlinear estimator that is widely applied for target tracking [43]. The EKF has three main steps, i.e., state prediction, Kalman gain calculation, and state correction. Its algorithmic structure is similar to that of the conventional linear KF. Since linearization in the EKF only uses the first-order term in the Taylor expansion, the EKF’s estimation performance for nonlinear target tracking is limited. Although the EKF has extended the KF method to nonlinear applications, EKF still has many disadvantages. In the EKF, the linearization error can hardly be avoided, and the divergence problem always happens. An improved nonlinear filter, the cubature Kalman filter (CKF) [44], is an algorithm that is approximately the closest to the Bayesian filter in theory. After strict mathematical derivation, the estimation accuracy and convergence of the CKF are guaranteed in theory. CKF refers to the idea of the particle filter. In the CKF, many particles with identical weights are selected for nonlinear function propagation processing according to the cubature criterion and the prior probability density distribution. Therefore, the calculation cost is low, there is no need to linearize the nonlinear function, and the linearization error is eliminated. The divergence problem in the EKF was also resolved.

3. Moving-Target Detection and Extraction

Moving-target tracking based on the monocular camera first needs to accurately capture a moving target in a video image. Therefore, this section proposes a moving-target detection method based on an improved PWC-Net, and a moving-target extraction method that is a combination of the improved K-means aggregation clustering algorithm, frame difference method, and morphological operation. The diagram of the proposed detection and extraction method is shown in Figure 4.

3.1. Video Image Preprocessing

3.1.1. Grayscale Processing

Grayscale processing converts color images into grayscale images. Grayscale images can enhance and highlight image features. Because the expression of grayscale image information is simple, it can reflect the local and overall chromaticity with different digital values, brightness, and additional scene information. This paper uses the averaged weight method to complete the image processing. Different weights are allocated according to the three primary colors, i.e., red, green, and blue. Since people are sensitive to green and red, the weights of the grayscale processing are always set with special values [45,46]. In this paper, we define them as follows:

g (i, j) = 0.3 R (i, j) + 0.59 G (i, j) + 0.11 B (i, j) .

(10)

Two groups of video images were collected. The two processed video images represent the different motions of different moving targets and motion states in different indoor and outdoor scenes. As shown in Figure 5, a basketball is falling and a target person is running, and the processed figures are applied for further target object extraction.

3.1.2. Wavelet Transform Threshold for Noise Elimination

Noise exists in an image after grayscale processing, which impacts the final results since they have different characteristics after orthogonal wavelet decomposition. Thus, the wavelet transform threshold is applied to eliminate the noise. The purpose of this method is to find a suitable threshold that can guarantee appropriate wavelet coefficients. The applied wavelet transform threshold method for noise elimination has three steps. We define an actual signal as follows:

q (t) = s (t) + e (t)

(11)

where

s (t)

is the effective signal, and

e (t)

is the noise.

The first step is to select the wavelet base. In terms of the wavelet base, we selected the Haar wavelet and then performed the orthogonal wavelet transform on the measured signal:

W T_{q} (a, b) = W T_{s} (a, b) + W T_{e} (a, b) .

(12)

The second step is to select the threshold and threshold processing function to determine the wavelet transform coefficients. We selected the soft threshold processing function:

f_{λ}^{s} (w) = \{\begin{matrix} , | w | \geq λ \\ 0, | w | < λ \end{matrix}

(13)

where w is the wavelet coefficient, and

λ

is the selected threshold. To set an appropriate threshold, we used

λ = σ \sqrt{2 ln N}

(14)

N is the sum of the number of wavelet coefficients of the actual signal after wavelet transform decomposition, and

σ

is the standard deviation of the noise signal.

The third step is reconstructing the signal according to the low-frequency and high-frequency coefficients after wavelet decomposition. The process of the wavelet transform threshold method is shown in Figure 6. Then, the images in Figure 5 were processed using the wavelet transform threshold method, and the results are shown in Figure 7, indicating improved performance.

3.1.3. Contrast Limited Adaptive Histogram

Equalization (CLAHE) To further simplify the complexity for object extraction in the later optical flow step, we needed to enhance the image contrast. Thus, since the images had been processed during the previous two steps, i.e., grayscale processing and noise elimination, we directly utilized the CLAHE method [47] to improve the images. The CLAHE algorithm restricts the histogram of each sub-block region to an appropriate area to overcome the problem of overamplifying the noise. Referring to [48], the image processed by the CLAHE algorithm is shown in Figure 8.

3.1.4. Optical Flow Estimation by Improved PWC-Net

With the pre-processed images, we next needed to identify a moving object by using the optical flow method before accurately extracting the target object. PWC-Net is a deep-learning optical flow estimation network proposed in 2018 [19]. However, in the PWC-Net method, the warping problem exists that results in doubling images, white space, ambiguity, and invalid information phenomena, as shown in Figure 9. To resolve the warping problem, we needed to detect the areas with the phenomena of doubling images, white space, ambiguity and invalid information, and eliminate them accurately. The asymmetric occlusion-aware feature matching module (AsymOFMM) that can learn occlusion masks was proposed in 2020 [21]. The AsymOFMM method can predict an occluded area and filter out useless information generated by warping without additional supervision with almost negligible calculation costs. AsymOFMM’s overall structure on the layer of the feature pyramid in the improved PWC-Net is shown in Figure 10.

The parameter training strategy of the improved PWC-Net was similar to that of PWC-Net. “Flying Chairs” was the basic training set, and the initial learning rate was equal to

0.0001

. The batch size was 8 with

1.2

M iterations, and it halved the learning rate when the number of iterations reached

0.4

,

0.6

,

0.8

, and 1 M. The FlyingThings3D dataset was finetuned. The initial learning rate was

= 0.00001

and batch size

= 4

for

0.5

M iterations. Then, the learning rate was halved when the number of iterations reached

0.2

,

0.3

, and

0.4

M. The loss function adopted the standard error measurement parameter in optical flow estimation, end-point error (EPE). The calculation formula of EPE is as follows:

E P E = Σ \sqrt{{(u - u_{g})}^{2} + {(v - v_{g})}^{2}} .

(15)

where u and v are the components of each pixel’s predicted optical flow value in the transverse and longitudinal directions, and

u_{g}

and

v_{g}

are the components of the actual optical flow value in the label in the transverse and longitudinal directions.

As shown in Table 1, through many experiments on the Sintel, KITTI 2012, and KITTI 2015 datasets, the improved PWC-Net based on AsymOFMM substantially improved all datasets. In the table, AEPE refers to the average EPE of all effective pixels in the image, FL-all refers to the percentage of abnormal optical flow values in all effective pixels in the image, and the calculation formula of AEPE is as follows:

A E P E = \frac{Σ \sqrt{{(u - u_{g})}^{2} + {(v - v_{g})}^{2})}}{m n},

(16)

where m and n are the numbers of pixels in the horizontal and vertical directions of the image, respectively.

We directly used PWC-Net to process the images in Figure 8 and acquire the results in Figure 11.

The improved PWC-Net was used for optical flow prediction for two frame images. The two different visualization results are shown in Figure 12. Next, we needed to extract the moving objects for the processed optical flow images.

3.2. Target Extraction

Next, we needed to accurately locate the position of the moving target in the 2D camera view from the processed optical flow images. However, in these images, the optical flow of the moving target was still mixed with some background edge values that interfered with the accuracy of extracting the moving target, as shown in the blue areas in Figure 12. This paper proposes a target extraction strategy using an improved K-means and aggregation clustering algorithm combined with a frame difference method and morphological operation to eliminate useless background optical flow. In addition, to improve the efficiency, we first set a threshold for the optical flow values to filter interesting dynamic information from the optical flow image in Figure 12. The optical flow after simple threshold filtering is shown in Figure 13 as an example. Then, the proposed 2-step extraction method was applied.

3.2.1. Improved K-Means and Agglomerative Clustering Algorithm

On the basis of the improvement of the K-means and agglomerative clustering algorithm, the proposed algorithm was divided into two stages to cluster the optical flow samples combined with the advantage of not specifying the number of categories before clustering.

At the first stage, the key difference between the classical K-means algorithm and the proposed improved method is in the initialization process of K cluster centers. Specific conditions were added to the selection process of cluster centers to lengthen the distance between cluster centers enough to avoid the adverse effects caused by completely random initialization. First, at the first stage of the improved clustering algorithm, the corresponding sample features are established for each filtered optical flow in the optical flow field for similarity measurements. Specifically, suppose that, in the pixel coordinate, the position coordinates of the j-th optical flow vector in the image

{\vec{X}}_{j}

are

(x_{j}, y_{j})

, the corresponding optical flow amplitude is

A_{j}

, and the optical flow direction is

D_{j}

. Then, we define four-dimensional sample features

X_{j}

of the j-th optical flow vector as follows:

{\vec{X}}_{j} = [A_{j} D_{j} X_{j} y_{j}] .

(17)

To simplify the calculation, normalization processing is required. The normalization has

\{\begin{matrix} \tilde{A_{J}} = A_{j} / A_{m a x} \\ \tilde{D_{J}} = D_{j} / D_{m a x} \\ \tilde{X_{J}} = X_{j} / \sqrt{x_{m a x}^{2} + y_{m a x}^{2}} \\ \tilde{y_{J}} = y_{j} / \sqrt{x_{m a x}^{2} + y_{m a x}^{2}} . \end{matrix}

(18)

Similarly, the construction of other optical flow vector sample features was completed. Second, Euclidean distance was selected as the similarity measurement index at the first stage of the proposed improved clustering method. Specifically, once again, we defined one of the categories in the clustering process as

S_{k}

with cluster center sample features

C_{k}

. Then, the similarity between the j-th optical flow vector in image

{\vec{X}}_{j}

and the cluster category was

S_{k}

, expressed as the Euclidean distance between sample features

X_{j}

and the cluster center sample features

C_{k}

:

d_{j k} = | | X_{j} - C_{k} | | = \sqrt{Σ_{i = 1}^{4} {(X_{j i} - C_{k i})}^{2}} .

(19)

A small distance denotes a high similarity between the optical flow vector and a class.

Lastly, the numbers of clusters and their cluster centers were automatically determined by using the sample features and similarity measurement function defined above, and the optical flow clustering in the first stage was completed by combining it with the classical K-means algorithm. The specific steps are as follows:

(1): Initialize the first cluster center. Set the first optical flow class $S_{1}$ , randomly select a filtered optical flow vector (blue parts in Figure 13) assigned to $S_{1}$ and as the center of the $S_{1}$ , and assume its sample features as the central sample features of category $C_{1}$ .
(2): Calculate the similarity. Select an unclassified optical flow $\vec{X}$ in the optical flow field, calculate its similarity with the current existing clustering class, and record the Euclidean distance between its sample features and the sample features of the center of its most similar category.
(3): Optical flow classification. Set a threshold of T. If $d > T$ , we set a new clustering class and classify the corresponding optical flow $\vec{X}$ into the new class to serve as the center of the new class. If $d \leq T$ , we classify the corresponding optical flow into its most similar class, calculate the average value of all the optical flow sample features in this class, and take the average value as the central sample features of this class to complete the update of the center.
(4): For each other unclassified optical flow in the optical flow field, repeat Steps 2 and 3 until the cluster center does not change, and output the final K cluster centers and corresponding cluster members.

The optical flow eliminated by the first stage of the clustering algorithm in this paper is shown in the figure below. Thus, the sample data in blue in Figure 13 were divided into the differently colored clusters in Figure 14. The white dot in the figure is the clustering center of each class. The foreground optical flow was aggregated in a large area due to the similar characteristics, while the background optical flow was scattered into many categories due to their differences.

At the second stage of the improved algorithm, the idea of the cohesion method in hierarchical clustering is used. After a certain number of clusters are obtained in the first stage, some clusters should be merged, and the number of clusters can lastly be optimized. We propose a modified method to optimize the cluster number that reproduces some data and refers to the cohesion method. Unlike the classical cohesion method, the proposed method reproduced some assist points to improve clustering performance.

(1): Eight new data samples were calculated according to the position of each cluster center. The eight new data points were centered on one cluster center and arranged at a certain angle. Assuming that the cluster central position is $(C_{x k}, C_{y k})$ , the new produced data point is $(m_{i}, n_{i})$ , which has

$\begin{matrix} m_{i} = C_{x k} + cos (a n g l e [i]) \times n u m b e r \\ n_{i} = C_{y k} - sin (a n g l e [i]) \times n u m b e r \end{matrix}$

(20)

where, $a n g l e = [0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}, 180^{\circ}, 225^{\circ}, 270^{\circ}, 315^{\circ}]$ ;
(2): Then, the distances between all newly generated data points from two different clusters (for example, 8 points for Cluster 1 and another 8 points from Cluster 2) are calculated, and the shortest distance is selected as the inter-class distance.
(3): Set the threshold and merge the two categories whose inter-class distance between two clusters is less than the threshold.
(4): Update and return the final cluster center and the cluster member.

The overall flow of the optical flow clustering algorithm in this paper is shown in Figure 15.

The optical flow calculated by the improved clustering algorithm is shown in Figure 16. Few clusters existed, and they could help in target extraction. Next, after setting a threshold, most optical flow noise can be effectively eliminated through a binarization step. The final optical flow is shown in Figure 17; thus, only true moving-target information remained.

3.2.2. Accurate Extraction Based on Frame Difference Method and Morphological Operation

Figure 17 results show that the target areas still had some black parts that divided the true targets into multiple parts. We extracted the final moving target with an extra step to improve the accuracy.

The frame difference method is commonly used for moving-target detection and segmentation. The flow chart of the two-frame difference method is shown in Figure 18. The n-th and

n - 1

-th frames in a video are recorded as

γ_{n}

and

γ_{n - 1}

, respectively, and the gray values of the corresponding pixel points of the two frame images are recorded as

γ_{n} (x, y)

and

γ_{n - 1} (x, y)

, respectively. Here,

(x, y)

represents the positions of the pixel points in the image. When calculating a difference image, the gray values of corresponding pixel points at each position on the two frame images are successive. Then, we can obtain the absolute value of image difference

D_{n} (x, y)

:

D_{n} (x, y) = | γ_{n} (x, y) - γ_{n - 1} (x, y) | .

(21)

Next, we set another threshold value for the gray value difference, binarized each pixel’s gray value difference according to Equation (22), and lastly obtained binarized image

R_{n}^{^{'}} (x, y)

.

R_{n}^{^{'}} (x, y) = \{\begin{matrix} 255, D_{n} (x, y) > ϵ \\ 0, o t h e r w i s e . \end{matrix}

(22)

Thus, the difference between two successive images is shown in Figure 19.

Then, morphological processing was applied [49] that eliminated the thorns, and filled the narrow discontinuities and small holes in the image using the results in Figure 17 and Figure 19. In other words, the target areas in Figure 17 and Figure 19 were added together. The region of the interested target after morphological processing and frame difference fusion is shown in Figure 20. The extraction strategy proposed in this paper could separate the background and moving-target optical flow well. Figure 21 shows the final extracted moving target.

4. Target Localization Based on Pinhole Imaging Theory

After the moving target is detected and extracted, the location of the moving target can be calculated. This paper proposes a simple mathematical model based on pinhole imaging theory. The target position relative to the camera can be calculated only by using the camera parameters and the measured two-dimensional image. Camera installation information was required, such as the camera’s position, and the camera lens was horizontal.

4.1. Localization Problem Formulation

The pinhole imaging model is the camera imaging principle shown in Figure 22a. An actual object point M in the 3D space had a corresponding image point m on the imaging plane after passing through camera optical center C. A specific proportional relationship exists between the size of the true object and its image. According to the Gaussian imaging formula, we have

\frac{1}{f} = \frac{1}{a} + \frac{1}{b}

(23)

where f is the camera’s focal length, a is the object distance, and b is the image distance of the object. Since the object distance is much larger than the image distance, Equation (23) can be approximated as follows:

\frac{1}{f} = \frac{1}{b} .

(24)

Assuming that the imaging plane coincided with the focal plane was reasonable, since the camera was horizontally installed. The distance between the imaging plane and the optical center was equal to the focal length f of the camera.

Four coordinates are usually mentioned in the proposed localization algorithm, i.e., the camera coordinate, world coordinates, image coordinate, and pixel coordinate. The camera coordinate uses

(X_{c}, Y_{c}, Z_{c})

to represent a point position, and the origin is the camera. The world coordinate is denoted by

(X, Y, Z)

.

(X_{i}, y_{i})

presents a position in the image coordinate. In the pixel coordinate, the position is represented by

(u, v)

.

The difference between the image and pixel coordinate systems is shown in Figure 22b. The origin of the image coordinate is

o_{i}

, and in the pixel coordinate, the origin is denoted by

(u_{0}, v_{0})

. These two coordinates have the same system, but with different scalars. Under the assumption that the side length of each pixel in the image is d, the following relationship between the image and the pixel coordinates obtained from Figure 22b takes the following form:

X_{i} = (u - u_{0}) d,

(25)

y_{i} = (v - v_{0}) d .

(26)

Expression (25) and Equation (26) can be rewritten as follows:

[\begin{matrix} u \\ v \\ x \end{matrix}] = [\begin{matrix} \frac{1}{d} & 0 & u_{0} \\ 0 & \frac{1}{d} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} X_{i} \\ y_{i} \\ 1 \end{matrix}]

(27)

In order to simplify the calculation and analysis, we built a pinhole imaging model as shown in Figure 23. A new imaging plane at focal length f from the optical center of the camera was built; thus, 3D points were projected onto the camera coordinate system. Specifically, as shown in Figure 23,

O_{c} X_{c} Y_{c} Z_{c}

is the camera coordinate system,

o_{i} X_{i} y_{i}

is the image coordinate system,

O_{c}

is the optical center, and the

Z_{c}

axis coincided with the camera’s optical axis.

P (X_{c}, Y_{c}, Z_{c})

is the point position in the 3D space of the camera coordinate system, and

p (X_{i}, y_{i})

is the 3D point’s projected position in the image coordinate system.

According to Figure 23, combined with Equation (24),

P (X_{c}, Y_{c}, Z_{c})

and

p (X_{i}, y_{i})

have the following relationship:

\frac{X_{c}}{X_{i}} = \frac{Z_{c}}{f},

(28)

\frac{Y_{c}}{Y_{i}} = \frac{Z_{c}}{f} .

(29)

We have

Z_{c} [\begin{matrix} X_{i} \\ y_{i} \\ 1 \end{matrix}] = [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ 1 \end{matrix}]

(30)

On the basis of Equations (27) and (30), we obtain

Z_{c} [\begin{matrix} X_{i} \\ y_{i} \\ 1 \end{matrix}] = [\begin{matrix} \frac{1}{d} & 0 & u_{0} \\ 0 & \frac{1}{d} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ 1 \end{matrix}] .

(31)

Let

M = [\begin{matrix} \frac{1}{d} & 0 & u_{0} \\ 0 & \frac{1}{d} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}]

. Then, Equation (31) is simplified into

Z_{c} = [\begin{matrix} u \\ v \\ 1 \end{matrix}] = M [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ 1 \end{matrix}] .

(32)

Equation (32) shows that, for a coordinate point in 3D space, a unique pixel point can be found in its imaging plane. However, depth information cannot be obtained. Therefore, additional information is required, since we used monocular vision for 3D localization. Therefore, this paper proposes a strategy using the camera’s installation height to calculate the final 3D position that includes two steps: (1) computing the target azimuth angle and depth measurements, and (2) calculating the target Cartesian position using the measurements.

4.2. Angle and Distance Measurements

In the camera coordinate, azimuth and pitch/elevation angles are defined as

α

and

β

, respectively, and are shown in Figure 23 and have

α = arctan \frac{X_{c}}{Z_{c}},

(33)

β = arctan \frac{Y_{c}}{Z_{c}} .

(34)

Substituting (28) and (29) into (33) and (34) yields

α = arctan \frac{X_{i}}{f} = arctan \frac{(u - u_{0}) d}{f},

(35)

β = arctan \frac{y_{i}}{f} = arctan \frac{(v - v_{0}) d}{f} .

(36)

Therefore, according to the position of the object point in the image plane, the angular information of the target object relative to the camera in 3D space can be roughly obtained. Angular information is acquired by only using the camera pixel image, pre-known visual size, and focal length. Lens distortion was ignored, or the noise caused by lens distortion satisfied a Gaussian distribution. However, it is still impossible to calculate the target depth by only using the azimuth and pitch angles. Additional reference information is required, e.g., the camera installation height. As shown in Figure 24, the camera was installed at height h from the ground. Moreover, the camera was placed manually, and its observation direction was known, defined as

θ

. The world coordinate system was established simultaneously, and the origin of the two coordinate systems coincided. For example the angular diagram for when the target moved on the ground is shown in Figure 24.

The relationship between the world coordinate system and the camera coordinate systems can be expressed in the following form:

[\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \end{matrix}] = [\begin{matrix} 1 & 0 & 0 \\ 0 & cos θ & - sin θ \\ 0 & sin θ & cos θ \end{matrix}] [\begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \end{matrix}]

(37)

From Equations (28) and (29), we can obtain the 2D image position of the target centroid,

(X_{i}, y_{i})

. Since the camera focal length f and the target centroid’s Y-axis position in the camera coordinate system are both known, the target centroid position

(X_{c}, Y_{c}, Z_{c})

can be calculated. Then, the distance between the target and the camera can be obtained,

S = \sqrt{X_{c}^{2} + Y_{c}^{2} + Z_{c}^{2}} .

(38)

Since camera tilt angle

θ

is known, and the target centroid pitch angle can be obtained from Equation (36), according to the geometric relationship, Figure 24 shows

O_{c} A = \frac{h}{sin θ + β},

(39)

A B = O_{c} A sin | β | .

(40)

In particular, when

β = 0

, that is,

y_{i} = 0

is satisfied,

Y_{c} = 0

and

Z_{c} = \frac{h}{sin θ}

can be easily obtained from Figure 24. At that time,

X_{c} = \frac{Z_{c} X_{i}}{f}

is acquired from Equation (30). In addition, when the height of the target cannot be ignored (i.e., the target is not mass on the ground),

O_{c} A = \frac{h - l}{sin (θ + β)}

exists. Accordingly, when

β = 0

holds,

Z_{c} = \frac{h - l}{sin θ}

can be known, where l is the height of the target centroid from the ground. The target height can be estimated using some known or pre-given reference static object.

5. Moving-Target Tracking Using Visual Image and Cubature Kalman Filter

The discrete state transition and observation equations of the nonlinear system in this paper take the following form:

\begin{matrix} X_{k} = f (X_{k - 1}) + ω_{k} \\ Z_{k} = h (X_{k}) + ζ_{k}, \end{matrix}

(41)

where

X_{k}

is the state vector at time instant k,

Z_{K}

is the measurement vector at k,

f (\cdot)

is the dynamic model function,

h (\cdot)

is the measurement function. Here, we used both angle and distance measurements.

ω_{k}, ζ_{k}

were assumed to be Gaussian white noise with zero additive mean, which represents the process and measurement noise of which the covariance matrices are

Q_{k}, R_{k}

, respectively.

Since the CKF has many advantages in nonlinear estimation [44], it has been widely applied as a state-of-the-art method. The structure of the CKF algorithm is divided into two parts, i.e., prediction and update with measurement steps.

Prediction update:

(i): Decompose estimation error covariance matrix.

$P_{k | k} = S_{k | k} {(S_{k | k})}_{T} .$

(42)
(ii): Calculate the cubature point.

$X_{i, k | k} = S_{k | k} {xi}_{i} + {\hat{X}}_{k | k}, i = 1, 2, Λ m .$

(43)
(iii): Calculate the propagation cubature point of the state transfer function.

$X_{i, k + 1 | k}^{*} = f_{K} (X_{i, k | k}), i = 1, 2, \dots, m,$

(44)

where $X_{i, k | k}$ and $X_{i, k + 1 | k}^{*}$ are cubature points, m is the number of cubature points. When using the third-order spherical radial criterion, the number of cubature points should be twice the dimension n of the state vector of the nonlinear system. $ξ_{i}$ is the cubature point set, $ξ_{i} = \sqrt{\frac{m}{2}} [1]$ . $i = 1, 2, . . ., m$ . $[1]$ is the point set of n-dimensional space, where $[1] = [\begin{matrix} (\begin{matrix} 1 \\ 0 \\ ⋮ \\ 0 \end{matrix}), (\begin{matrix} 0 \\ 1 \\ ⋮ \\ 0 \end{matrix}), \dots, (\begin{matrix} 0 \\ 0 \\ ⋮ \\ 1 \end{matrix}), (\begin{matrix} - 1 \\ 0 \\ ⋮ \\ 0 \end{matrix}), (\begin{matrix} 0 \\ - 1 \\ ⋮ \\ 0 \end{matrix}), \dots, (\begin{matrix} 0 \\ 0 \\ ⋮ \\ - 1 \end{matrix}) \end{matrix}]$ .
(iv): Calculate status prediction value.

${\hat{X}}_{k + 1 | k} = \frac{1}{m} \sum_{i = 1}^{m} X_{i, k + 1 | k}^{*} .$

(45)
(v): Calculate the prediction covariance matrix.

$P_{k + 1 | k} = \frac{1}{2 n} \sum_{i = 1}^{2 n} X_{i, k + 1 | k}^{*} {(X_{i, k + 1 | k}^{*})}^{T} - {\hat{X}}_{k + 1 | k} {(\hat{k + 1 | k})}^{T} + Q_{k} .$

(46)

Update with measurement:

(i): Decompose the prediction covariance matrix.

$P_{k + 1 | k} = S_{k + 1 | k} {(S_{k + 1 | k})}^{T} .$

(47)
(ii): Calculate the updated cubature point.

$X_{i, k + 1 | k} = S_{k + 1 | k} ξ_{i} + {\hat{X}}_{k + 1 | k}, i = 1, 2, \dots, m .$

(48)
(iii): Calculate the propagation cubature point of the measurement function.

$Z_{i, k + 1 | k} = h_{K + 1} (X_{i, k + 1 | k}), i = 1, 2, \dots, m .$

(49)
(iv): Calculate the measured predicted value.

${\hat{Z}}_{k + 1 | k} = \frac{1}{m} \sum_{i = 1}^{m} Z_{i, k + 1 | k} .$

(50)
(v): Calculate the innovation,

$e_{K + 1} = Z_{K + 1} - {\hat{Z}}_{k + 1 | k}$

(51)

where $Z_{k + 1}$ is the measured value at $k + 1$ .
(vi): Calculate innovation covariance matrix.

$P_{Z Z, k + 1 | k} = \frac{1}{2 n} \sum_{i = 1}^{2 n} Z_{i, k + 1 | k} {(Z_{i, k + 1 | k})}^{T} - {\hat{Z}}_{k + 1 | k} {({\hat{Z}}_{k + 1 | k})}^{T} + R_{k} .$

(52)
(vii): Calculate the cross covariance matrix.

$P_{X Z, k + 1 | k} = \frac{1}{2 n} \sum_{i = 1}^{2 n} X_{i, k + 1 | k} {(Z_{i, k + 1 | k})}^{T} - {\hat{X}}_{k + 1 | k} {({\hat{Z}}_{k + 1 | k})}^{T} .$

(53)
(viii): Calculate the cubature Kalman filter gain.

$K_{k + 1} = P_{X Z, k + 1 | k} {(P_{Z Z, k + 1 | k})}^{- 1}$

(54)
(ix): Calculate the estimated state value at $k + 1$ .

${\hat{X}}_{k + 1 | k + 1} = {\hat{X}}_{k + 1 | k} + K_{k + 1} (Z_{K + 1} - {\hat{Z}}_{k + 1 | k}) .$

(55)
(x): Calculation of estimation error covariance matrix.

$P_{k + 1 | k + 1} = P_{k + 1 | k} - K_{k + 1} P_{Z Z, k + 1 | k} K_{k + 1}^{T} .$

(56)

On the basis of the pinhole imaging and positioning model described in Section 4, the CKF is used to estimate the 3D target state compared with the assumed dynamic models.

6. Experimental Examples

In this paper, we used an ordinary monocular industrial camera to acquire some videos, and the detailed parameters of the camera are shown in Table 2. The methods proposed in this paper were verified step by step. More specifically, the improved optical flow method introduced in Section 3 is first verified, and then the proposed geometrical algorithm using the pinhole imaging method is demonstrated. Lastly, combined with the CKF algorithm, we verify the effectiveness of the proposed strategy in Section 3, Section 4 and Section 5 via different examples with some comparisons.

6.1. Target Detection and Extraction (Using the Proposed Method in Section 3)

Several moving target videos with different target objects and motion states were randomly taken in indoor and outdoor environments. The optical flow calculation and moving-target extraction network using the proposed algorithm could effectively suppress the interference of background noise and accurately extract the region of interest. The experimental results are shown in Figure 25. The first column shows the pre-processed images, the second column indicates the optical flow visualization images, the third column shows the moving target segmentation results, and the fourth column displays the final extracted moving targets.

6.2. Moving Target Localization (Using the Proposed Methods in Section 4 and Section 5)

In order to verify the accuracy of the positioning method, in this experiment, several identified target images were captured. Then, the data were calculated, and we lastly compared the results with the true values measured with an advanced RGB camera. The first step was to set up some experimental scenarios and randomly place the experimental target in multiple positions. The camera’s height, shooting angle, and target position were different. In the second step, we recorded the relevant data needed for model calculation, such as the camera’s height and shooting angle, collected the target image, and extracted target information from the 2D image by using the proposed algorithm in Section 3; then, we calculated the targets’ locations. Six pictures captured from the previous videos in Figure 25 were used. In the last step, the angle and distance measurements were used to calculate the 3D target positions of different examples. The experimental results are shown in Table 3 [42]. Table 3 shows that the target’s locations computed with the proposed geometrical method were very close to the true values, but some errors still existed that had been caused by the extraction process and camera lens distortion noise, which commonly satisfy Gaussian distribution.

The modified CKF method was applied to process the measurements to improve the target tracking performance. The CKF utilized our contribution regarding the modified measurement model proposed in (39) and (40). In the CKF algorithm, we assumed that the target moved with constant velocity, and in each image, multiple angular measurements could be produced to guarantee the CKF running. In the CKF simulation, the sampling time interval was T. With the proposed pinhole imaging localization method, the state vector contained the 3D position and velocity of the moving target. The measurement was the 2D target position in the image coordinate system. Therefore, the dynamic system model has

X_{k} = {[\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ V_{X} \\ V_{Y} \\ V_{Z} \end{matrix}]}_{k} = [\begin{matrix} 1 & 0 & 0 & T & 0 & 0 \\ 0 & 1 & 0 & 0 & T & 0 \\ 0 & 0 & 1 & 0 & 0 & T \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}] . {[\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ V_{X} \\ V_{Y} \\ V_{Z} \end{matrix}]}_{k - 1} + G ω_{k - 1},

(57)

and the measurement model is

Z_{k} = {[\begin{matrix} u \\ v \end{matrix}]}_{K} = h ({[X_{k}]}_{k}) + ζ_{k} = [\begin{matrix} \frac{f X_{C}}{d Z_{C}} + u_{0} \\ \frac{f Y_{C}}{d Z_{C}} + v_{0} \end{matrix}] + ζ_{k}

(58)

ω_{k - 1}

and

ζ_{k}

are the process and measurement noise, respectively.

G

is a transformation matrix.

To clearly show the advantages of the proposed CKF, a moving target with nonlinear process noise (maneuvering) was tracked. In the dynamic model, we have

G = {[\begin{matrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & sin [ϑ \frac{k - 1}{T}] & 0 & 0 & 0 & 0 \\ 0 \end{matrix}]}_{6 \times 6}

,

ω_{k - 1} \sim N (0, Q_{K - 1})

, and

Q_{k - 1} = {[\begin{matrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & τ & 0 & 0 & 0 & 0 \\ 0 \end{matrix}]}_{6 \times 6} .

ϑ

and

τ

determine the amplitude and frequency, respectively, of the moving target during sinusoidal motion. The EKF method [50,51] was applied as a comparison, and both the EKF and CKF were applied with 1000 Monte Carlo repeats.

In this part, Examples 1 and 2 are provided with different target initial positions (EKF and CKF had the same initial state). In Example 1, the true target initial state was

{[300, 400, - 800, 2, 4, 6]}^{T}

. The initial states are

[X_{0} = {[320, 420, - 820, 2.1, 3.8, 6.3]}^{T}]

,

R_{k} = diag [1, 1]

and

P_{0} = diag [100, 100, 100, 100, 100, 100]

for both Kalman filters. The sampling time was reset to

0.15

s, and the total running time was 300 s. To ensure the sinusoidal motion of the moving target, we set amplitude

ϑ = 600

m in the noise driving matrix, and the frequency

τ = 0.12

s in the process noise variance.

In Example 2, the true target initial state was

{[- 800, 300, 400, 1, 1, 2]}^{T}

. The initial states is

{[100, - 100, 630, 4.2, 0.8, 2.2]}^{T}

,

R_{k} = diag [1, 1]

and

P_{0} = diag [1000, 1000, 1000, 1000, 1000, 1000]

for both two Kalman filters. We set the amplitude

ϑ = 100

m in the noise driving matrix and the frequency

τ = 0.1

s in the process noise variance. The sampling time was also

0.2

s, and the total running time was 200 s. The estimated results using the different methods were recorded. The true and estimated target trajectories using EKF and CKF were drawn as well. The RMSEs between the estimated result and the true target states of the different methods are shown in Figure 26, Figure 27, Figure 28 and Figure 29.

The first example shows that the proposed CKF achieved satisfactory convergence performance when tracking a sinusoidal moving target. The CKF also had better estimation accuracy than that of the EKF in both position and velocity estimation. The EKF had to diverge in some of the Monte Carlo runs, resulting in target loss. In the second example, divergence did not happen in either method since the true target initial position became friendly to the EKF. However, the proposed CKF still had more accurate estimation results than those of the EKF. The effectiveness and superiority of the proposed CKF in mobile target tracking were verified.

7. Conclusions

This paper addressed moving-target detection and tracking problems in a 3D space with a proposed visual target tracking system only using a 2D camera. First, an improved optical flow method combined with a clustering algorithm was developed to accurately detect and extract a moving target from a noisy background. Second, a geometrical pinhole imaging algorithm was proposed to roughly and quickly estimate the target position. Third, the CKF algorithm was modified by directly using the camera pixel measurement for moving-target tracking in a 3D space. Simulation and experimental results demonstrated the effectiveness of the proposed systematic method. The work in this paper has great practical value, as it provides a solution for mobile-target detection and estimation using low-cost equipment.

In the future, we will further improve the accuracy of moving-target detection and estimation by considering the distortion problem of the camera lens. In addition, a pan–tilt–zoom serving system will be combined with the proposed system in this paper to further improve mobile-target tracking performance.

Author Contributions

Methodology, S.W. and S.X.; Software, Z.M.; Data curation, Z.M.; Writing—original draft, S.W.; Writing—review & editing, S.X.; Supervision, S.X., D.W. and W.L.; Project administration, S.X.; Funding acquisition, D.W. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62273327 and 32001424), the Guangdong Basic and Applied Basic Research Foundation (2020B1515120054), the Key Research and Development Plan of Shandong Province (2021CXGC011304), the Shenzhen Science and Technology Program (KCXFZ20211020165003005, JCYJ20210324102401005), and the Shenzhen Basic Key Research Project (JCYJ20200109115414354).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, S.; Dogancay, K. Optimal sensor placement for 3-D angle-of-arrival target localization. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 1196–1211. [Google Scholar] [CrossRef]
Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A review of deep learning in multiscale agricultural sensing. Remote Sens. 2017, 14, 559. [Google Scholar] [CrossRef]
Xu, T.; Hao, Z.; Huang, C.; Yu, J.; Zhang, L.; Wu, X. Multi-modal Locomotion Control of Needle-like Microrobots Assembled by Ferromagnetic Nanoparticles. IEEE/ASME Trans. Mechatron. 2022, 27, 4327–4338. [Google Scholar] [CrossRef]
Wu, Q.; Guo, H.; Wu, X.; He, T. Real-Time running detection from a patrol robot. J. Integr. Technol. 2017, 6, 50–58. [Google Scholar]
Xu, T.; Guan, Y.; Liu, J.; Wu, X. Image-Based Visual Servoing of Helical Microswimmers for Planar Path Following. IEEE Trans. Autom. Sci. Eng. 2020, 17, 325–333. [Google Scholar] [CrossRef]
Park, J.; Cho, J.; Lee, S.; Bak, S.; Kim, Y. An Automotive LiDAR Performance Test Method in Dynamic Driving Conditions. Sensors 2023, 23, 3892. [Google Scholar] [CrossRef]
Li, M.; Fan, J.; Zhang, Y.; Lu, Y.; Su, G. Research of moving target detection technology in intelligent video surveillance system. J. Theor. Appl. Inf. Technol. 2013, 49, 613–617. [Google Scholar]
Sun, W.; Du, H.; Ma, G.; Shi, S.; Wu, Y. Moving vehicle video detection combining ViBe and inter-frame difference. Int. J. Embed. Syst. 2020, 12, 371–379. [Google Scholar] [CrossRef]
Xu, T.; Huang, C.; Lai, Z.; Wu, X. Independent Control Strategy of Multiple Magnetic Flexible Millirobots for Position Control and Path Following. IEEE Trans. Robot. 2022, 38, 2875–2887. [Google Scholar] [CrossRef]
Xu, T.; Hwang, G.; Andreff, N.; Régnier, S. Planar Path Following of 3-D Steering Scaled-Up Helical Microswimmers. IEEE Trans. Robot. 2015, 31, 117–127. [Google Scholar] [CrossRef]
Xu, S.; Liu, J.; Yang, C.; Wu, X.; Xu, T. A Learning-Based Stable Servo Control Strategy Using Broad Learning System Applied for Microrobotic Control. IEEE Trans. Cybern. 2021, 38, 2875–2887. [Google Scholar] [CrossRef]
Jain, R.; Nagel, H. On the Analysis of Accumulative Difference Pictures from Image Sequences of Real World Scenes. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 206–214. [Google Scholar] [CrossRef]
Li, C. Dangerous Posture Monitoring for Undersea Diver Based on Frame Difference Method. J. Coast. Res. 2020, 103 (Suppl. 1), 939–942. [Google Scholar] [CrossRef]
Li, T.; Jiang, B.; Wu, D.; Yin, X.; Song, H. Tracking Multiple Target Cows’ Ruminant Mouth Areas Using Optical Flow and Inter-Frame Difference Methods. IEEE Access 2019, 7, 185520–185531. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Qu, B. Three-Frame Difference Algorithm Research Based on Mathematical Morphology. Procedia Eng. 2012, 29, 2705–2709. [Google Scholar] [CrossRef]
Qu, J.; Xin, Y. Combined continuous frame difference with background difference method for moving object detection. Guangzi Xuebao/Acta Photonica Sin. 2014, 43, 213–220. [Google Scholar]
Wahyono Filonenko, A.; Jo, K. Unattended Object Identification for Intelligent Surveillance Systems Using Sequence of Dual Background Difference. IEEE Trans. Ind. Inform. 2016, 12, 2247–2255. [Google Scholar] [CrossRef]
Montero, V.; Jung, W.; Jeong, Y. Fast background subtraction with adaptive block learning using expectation value suitable for real-time moving object detection. J.-Real-Time Image Process. 2021, 18, 967–981. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, F.; Wang, Y.; Ouyang, L.; Zhang, B. Research on Design of Intelligent Background Differential Model for Training Target Monitoring. Complexity 2021, 2021, 5513788. [Google Scholar] [CrossRef]
Mitiche, A.; Mansouri, A. On Convergence of the Horn and Schunck Optical-Flow Estimation Method. IEEE Trans. Image Process. 2004, 13, 848–852. [Google Scholar] [CrossRef]
Sharmin, N.; Brad, R. Optimal Filter Estimation for Lucas-Kanade Optical Flow. Sensors 2012, 12, 12694–12709. [Google Scholar] [CrossRef]
Bouguet, J. Pyramidal Implementation of the Lucas Kanade Feature Tracker. Opencv Documents. 2000. Available online: https://web.stanford.edu/class/cs231m/references/pyr-lucas-kanade-feature-tracker-bouget.pdf (accessed on 15 May 2023).
Mahdianpari, M.; Salehi, B.; Rezaee, M.; Mohammadimanesh, F.; Zhang, Y. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sens. 2018, 10, 1119. [Google Scholar] [CrossRef]
Ranjan, A.; Black, M. Optical Flow Estimation Using a Spatial Pyramid Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2720–2729. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.; Kautz, J. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping and Cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8934–8943. [Google Scholar]
Hui, T.; Tang, X.; Loy, C. LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8981–8989. [Google Scholar]
Zhao, S.; Sheng, Y.; Dong, Y.; Chang, I.; Xu, Y. MaskFlowNet: Asymmetric feature matching with learnable occlusion mask. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–29 June 2020; pp. 6277–6286. [Google Scholar]
Gao, K.L.; Liu, B.; Yu, X.C.; Qin, J.C.; Zhang, P.Q.; Tan, X. Deep relation network for hyperspectral image few-shot classification. Remote Sens. 2020, 12, 923. [Google Scholar] [CrossRef]
Sa, I.; Popović, M.; Khanna, R.; Chen, Z.; Lottes, P.; Liebisch, F.; Nieto, J.; Stachniss, C.; Walter, A.; Siegwart, R. WeedMap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming. Remote Sens. 2018, 10, 1423. [Google Scholar] [CrossRef]
Liu, J.; Liu, W.; Gao, L.; Li, L. Detection and localization of underwater targets based on monocular vision. In Proceedings of the International Conference on Advanced Robotics and Mechatronics (ICARM), Hefei and Tai’an, China, 29–31 August 2017; pp. 100–105. [Google Scholar]
Bi, S.; Gu, Y.; Zhang, Z.; Liu, H.; Gong, M. Multi-camera stereo vision based on weights. In Proceedings of the IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Dubrovnik, Croatia, 25–28 May 2020; pp. 1–6. [Google Scholar]
Ma, W.; Li, W.; Cao, P. Binocular Vision Object Positioning Method for Robots Based on Coarse-fine Stereo Matching. Int. J. Autom. Comput. 2020, 17, 86–95. [Google Scholar] [CrossRef]
Mao, J.; Huang, W.; Sheng, W. Target distance measurement method using monocular vision. IET Image Process. 2020, 14, 3181–3187. [Google Scholar]
Cao, Y.; Wang, X.; Yan, Z. Target azimuth estimation for automatic tracking in range-gated imaging. Int. Soc. Opt. Eng. 2012, 8558, 1429–1435. [Google Scholar]
Park, J.; Kim, T.; Kim, J. Model-referenced pose estimation using monocular vision for autonomous intervention tasks. Auton. Robot. 2019, 44, 205–216. [Google Scholar] [CrossRef]
Qiu, Z.; Hu, S.; Li, M. Single view based nonlinear vision pose estimation from coplanar points. Optik 2020, 208, 163639. [Google Scholar] [CrossRef]
Penelle, B.; Debeir, O. Target distance measurement method using monocular vision. In Proceedings of the Virtual Reality International Conference, Laval, France, 9–11 April 2014; pp. 1–7. [Google Scholar]
Vokorokos, L.; Mihalov, J.; Lescisin, L. Possibilities of depth cameras and ultra wide band sensor. In Proceedings of the IEEE International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 21–23 January 2016; pp. 57–61. [Google Scholar]
Xu, S.; Dogancay, K.; Hmam, H. 3D AOA Target Tracking Using Distributed Sensors with Multi-hop Information Sharing. Signal Process. 2018, 144, 192–200. [Google Scholar] [CrossRef]
Nguyen, V.-H.; Pyun, J.-Y. Location Detection and Tracking of Moving Targets by a 2D IR-UWB Radar System. Sensors 2015, 15, 6740–6762. [Google Scholar] [CrossRef]
Xu, S.; Dogancay, K.; Hmam, H. Distributed Pseudolinear Estimation and UAV Path Optimization for 3D AOA Target Tracking. Signal Process. 2017, 133, 64–78. [Google Scholar] [CrossRef]
Wang, S.; Guo, Q.; Xu, S.; Su, D. A moving target detection and localization strategy based on optical flow and pin-hole imaging methods using monocular vision. In Proceedings of the IEEE International Conference on Real-Time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021; pp. 147–152. [Google Scholar]
Xu, S.; Wu, L.; Dogancay, K.; Alaee-Kerahroodi, M. A Hybrid Approach to Optimal TOA-Sensor Placement with Fixed Shared Sensors for Simultaneous Multi-Target Localization. IEEE Trans. Signal Process. 2022, 70, 1197–1212. [Google Scholar] [CrossRef]
Arasaratnam, I.; Haykin, S. Cubature Kalman Filters. IEEE Trans. Autom. Control 2009, 54, 1254–1269. [Google Scholar] [CrossRef]
Kumar, T.; Verma, K. A Theory Based on Conversion of RGB Image to Gray image. Int. J. Comput. Appl. 2010, 7, 7–10. [Google Scholar] [CrossRef]
Sihotang, J. Implementation of Gray Level Transformation Method for Sharping 2D Images. INFOKUM 2019, 8, 16–19. [Google Scholar]
Zuiderveld, K. Contrast limited adaptive histogram equalization. In Graphics Gems; Academic Press: Cambridge, MA, USA, 1994; pp. 474–485. [Google Scholar]
Bahadarkhan, K.; Khaliq, A.; Shahid, M. A Morphological Hessian Based Approach for Retinal Blood Vessels Segmentation and Denoising Using Region Based Otsu Thresholding. PLoS ONE 2016, 11, e0162581. [Google Scholar] [CrossRef] [PubMed]
Comer, M.L.; Delp, E.J., III. Morphological operations for color image processing. J. Electron. Imaging 1999, 8, 279–289. [Google Scholar] [CrossRef]
Xu, S.; Ou, Y.; Wu, X. Optimal sensor placement for 3-D time-of-arrival target localization. IEEE Trans. Signal Process. 2019, 67, 5018–5031. [Google Scholar] [CrossRef]
Xu, S. Optimal sensor placement for target localization using hybrid RSS, AOA and TOA measurements. IEEE Commun. Lett. 2020, 24, 1966–1970. [Google Scholar] [CrossRef]

Figure 1. Pyramid Lucas-Kanade optical flow method.

Figure 2. Schematic diagram of monocular visual–spatial target localization.

Figure 3. Two-dimensional front view of monocular vision spatial target localization (from the camera side).

Figure 4. Moving-target detection and extraction process.

Figure 5. Images acquired after grayscale processing (an indoor basketball and an outdoor person).

Figure 6. The process of the wavelet transform threshold for noise elimination.

Figure 7. Images processed with the wavelet transform threshold method (original images are shown in Figure 5).

Figure 8. Final images after grayscale processing, noise elimination, and contrast enhancement.

Figure 9. Problems of doubling images, ambiguity, and invalid information.

Figure 10. AsymOFMM’s overall structure in the improved PWC-Net.

Figure 11. Results after PWC-Net.

Figure 12. Calculation results of the improved PWC-Net.

Figure 13. Threshold filtering results from Figure 12.

Figure 14. Calculation results of the first stage of the clustering algorithm.

Figure 15. Calculation flow of the improved clustering algorithm.

Figure 16. Calculation results of the second stage of the clustering algorithm.

Figure 17. Final calculation result of the improved clustering algorithm.

Figure 18. Work flowchart of two frame difference method.

Figure 19. The value difference image of the inter-frame difference method.

Figure 20. Fusion results of morphological processing and the frame difference method.

Figure 21. Moving target extraction results.

Figure 22. (a) Principle of camera imaging; (b) image coordinate system and pixel coordinate system.

Figure 23. Visual coordinate system.

Figure 24. Projection diagram of distance information measurement.

Figure 25. Experimental results of moving target detection and extraction. (a) A walking person outside; (b) a running person closer; (c) a small static ball indoors; (d) moving basketball indoors; (e) a walking person indoors.

Figure 26. Comparison of EKF and CKF for tracking nonlinear mobile target, unit: meters (Example 1).

Figure 27. Comparison of the position and velocity RMSEs between EKF and CKF of Example 1.

Figure 28. Comparison of EKF and CKF for tracking nonlinear mobile target, unit: meters (Example 2).

Figure 29. Comparison of the position and velocity RMSEs between EKF and CKF of Example 2.

Table 1. Performance comparison of the improved PWC-Net on various datasets.

	Forecast	Sintel Clean		Sintel Final		KITTI 2012		KITTI 2015
Network	Time	AEPE	AEPE	AEPE	AEPE	AEPE	AEPE	FL-All	FL-All
	(s)	Train	Test	Train	Test	Train	Test	Train	Test
PWC-Net	0.03	2.55	3.86	3.93	5.13	4.14	1.7	33.67%	9.60%
Improvement	0.03	2.33	2.77	3.72	4.38	3.21	1.1	23.58%	6.81%

Table 2. Experimental camera parameters.

Sensor specification	Advanced CMOS photosensitive chip, 1/2.7 inch
Pixel size	3 μm × 3 μm
Default speed	30 frames/s
Camera lens	Infrared, 60 degrees, no distortion
Signal to noise ratio	39 dB
Hardware	Industrial grade, 2 megapixels
Power	1 w
Working voltage	5 v
Output resolution	1920 × 1080
Interface	USB2.0, support UVC communication protocol
Focal length	6 mm

Table 3. Experimental results using the proposed method.

	Picture 1	Picture 2	Picture 3	Picture 4	Picture 5	Picture 6
Camera height	87.5 cm	87.5 cm	87.5 cm	87.5 cm	75.9 cm	75.9 cm
Shooting angle	45°	45°	45°	45°	30°	15°
Centroid position	(479,259)	(1,474,275)	(1,476,843)	(439,829)	(340,181)	(293,967)
Azimuth	−13.52°	14.41°	14.47°	−14.60°	−17.22°	−18.44°
Reference azimuth	−15.16°	20.5°	21.72°	−15.65°	−18.94°	−20.21°
Pitch	−8.00°	−7.55°	8.61°	8.22°	−10.18°	12.05°
Reference pitch	−9.85°	−9.54°	10.92°	10.79°	−11.69°	13.49°
Distance	149.5 cm	148.5 cm	112.1 cm	112.8 cm	234.1 cm	175.5 cm
Reference distance	157.4 cm	160.7 cm	113.4 cm	109.7 cm	241.3 cm	182.6 cm

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Xu, S.; Ma, Z.; Wang, D.; Li, W. A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera. Sensors 2023, 23, 4862. https://doi.org/10.3390/s23104862

AMA Style

Wang S, Xu S, Ma Z, Wang D, Li W. A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera. Sensors. 2023; 23(10):4862. https://doi.org/10.3390/s23104862

Chicago/Turabian Style

Wang, Shun, Sheng Xu, Zhihao Ma, Dashuai Wang, and Weimin Li. 2023. "A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera" Sensors 23, no. 10: 4862. https://doi.org/10.3390/s23104862

APA Style

Wang, S., Xu, S., Ma, Z., Wang, D., & Li, W. (2023). A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera. Sensors, 23(10), 4862. https://doi.org/10.3390/s23104862

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera

Abstract

1. Introduction

2. Preliminaries

2.1. Pyramid Lucas–Kanade Optical Flow Method

2.2. Target Localization Using Monocular Vision

2.3. Nonlinear Filter

3. Moving-Target Detection and Extraction

3.1. Video Image Preprocessing

3.1.1. Grayscale Processing

3.1.2. Wavelet Transform Threshold for Noise Elimination

3.1.3. Contrast Limited Adaptive Histogram

3.1.4. Optical Flow Estimation by Improved PWC-Net

3.2. Target Extraction

3.2.1. Improved K-Means and Agglomerative Clustering Algorithm

3.2.2. Accurate Extraction Based on Frame Difference Method and Morphological Operation

4. Target Localization Based on Pinhole Imaging Theory

4.1. Localization Problem Formulation

4.2. Angle and Distance Measurements

5. Moving-Target Tracking Using Visual Image and Cubature Kalman Filter

6. Experimental Examples

6.1. Target Detection and Extraction (Using the Proposed Method in Section 3)

6.2. Moving Target Localization (Using the Proposed Methods in Section 4 and Section 5)

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI