You are currently viewing a new version of our website. To view the old version click .
Remote Sensing
  • Article
  • Open Access

7 May 2021

Motion Estimation Using Region-Level Segmentation and Extended Kalman Filter for Autonomous Driving

,
,
,
,
and
1
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
2
School of Physics and Electronic Engineering, Fuyang Normal University, Fuyang 236037, China
3
School of Electrical and Electronic Engineering, Anhui Science and Technology University, Bengbu 233100, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Computer Vision and Image Processing

Abstract

Motion estimation is crucial to predict where other traffic participants will be at a certain period of time, and accordingly plan the route of the ego-vehicle. This paper presents a novel approach to estimate the motion state by using region-level instance segmentation and extended Kalman filter (EKF). Motion estimation involves three stages of object detection, tracking and parameter estimate. We first use a region-level segmentation to accurately locate the object region for the latter two stages. The region-level segmentation combines color, temporal (optical flow), and spatial (depth) information as the basis for segmentation by using super-pixels and Conditional Random Field. The optical flow is then employed to track the feature points within the object area. In the stage of parameter estimate, we develop a relative motion model of the ego-vehicle and the object, and accordingly establish an EKF model for point tracking and parameter estimate. The EKF model integrates the ego-motion, optical flow, and disparity to generate optimized motion parameters. During tracking and parameter estimate, we apply edge point constraint and consistency constraint to eliminate outliers of tracking points so that the feature points used for tracking are ensured within the object body and the parameter estimates are refined by inner points. Experiments have been conducted on the KITTI dataset, and the results demonstrate that our method presents excellent performance and outperforms the other state-of-the-art methods either in object segmentation and parameter estimate.

1. Introduction

Research on autonomous vehicles is being in the ascendant [1,2,3]. Autonomous Vehicles are cars or trucks that operate without human drivers, using a combination of sensors and software for navigation and control [4]. Autonomous vehicles require not only detecting and locating moving objects but also knowing their motion state relative to the ego-vehicle, i.e., motion estimation [5,6,7]. Motion estimation is a benefit to predict where other traffic participants will be at a certain period of time, and accordingly plan the route of the ego-vehicle. In this work, we propose a novel approach to estimate the motion state for autonomous vehicles by using region-level segmentation and Extended Kalman Filter (EKF).
Motion estimation involves three stages of object detection, tracking, and estimate of motion parameters including position, velocity, and acceleration in three directions. Accurate object detection is crucial for the high quality of motion estimation because the late two stages rely on the points within the object region; that is, only the points exactly within the object region can be used for tracking and parameter estimate. Existing works on motion estimation such as Refs. [8,9,10,11,12,13,14,15] normally generate bounding boxes as object proposals for the late two stages. One inherent problem of these methods is that the bounding boxes contain substantial background points as shown in Figure 1. These points are noise points and will result in unreliable object tracking and incorrect parameter estimate. To address this issue, we adopt two strategies: (1) Instead of bounding boxes, we use segmented object regions as object proposals. We employ the YOLO-v4 detector [16] to generate object bounding boxes and apply a region-level segmentation on them to accurately locate object contour and determine points within the objects (Figure 1 shows the results). (2) We compose an edge-point constraint on the feature points and apply the random sample consensus (RANSAC) [17] algorithm to eliminate outliers of tracking points so that the points used for tracking are ensured within the object body and the parameter estimate are refined by inner points. By the above processing, we can obtain a high-quality point set for tracking and parameter estimate, thereby generating accurate motion estimation.
Figure 1. An illustration of bounding boxes including objects/background points. Three bounding boxes are detected, each of which contains background points and an object. The object regions are accurately segmented by blue, red, and green masks generated by our region-level segmentation. The pixels within the masks are used as feature points for tracking and parameter estimates.
Other aspects affecting motion estimation are how to establish the motion model for tracking and how to optimize the parameter estimate. In this work, we use optical flow to track the feature points. We propose a relative motion model of the ego-vehicle and moving objects, and accordingly establish an EKF model for point tracking and parameter estimate. The EKF model takes the ego-motion into considerations and integrates optical flow, and disparity to generate optimized object position and velocity.
In summary, we propose a novel framework for motion estimation by using region-level segmentation and Extended Kalman Filter. The main contributions of the work are:
  • A region-level segmentation is proposed to accurately locate object regions. The proposed method segments object from a pre-generated candidate region, and refines it by combining color, temporal (optical flow), and spatial (depth) information using super-pixels and Conditional Random Field.
  • We propose a relative motion model of the ego-vehicle and the object, and accordingly establish an EKF model for point tracking and parameter estimate. The EKF model integrates the ego-motion, optical flow, and disparity to generate optimized motion parameters.
  • We apply edge-point constraint, consistency constraint, and the RANSAC algorithm to eliminate outliers of tracking points, thus ensuring that the feature points used for tracking are within the object body and the parameter estimates are refined by inner points.
  • The experimental results demonstrate that our region-level segmentation presents excellent segmentation performance and outperforms the state-of-the-art segmentation methods. The motion estimation experiments confirm the superior performance of our proposed method over the state-of-the-art approaches in terms of the root mean squared error.
The remainder of this paper is organized as follows: Section 2 briefly introduces the relevant works. Section 3 describes the details of the proposed method including object detection and segmentation, and tracking and parameter estimate. The experiments and results are presented and discussed in Section 4. Section 5 concludes the paper.

3. Method

The framework of the proposed method is shown in Figure 2. The main idea is to accurately determine feature points within the object through instance segmentation and predict the motion state by tracking the feature points through an EKF. The method includes two stages: (1) object segmentation, (2) tracking and motion estimate.
Figure 2. The framework of the proposed method. P( , ) represents the lateral and longitudinal distances of objects in the camera coordinates, and V( , ) denotes the lateral and longitudinal absolute velocities.
In the first stage, we use the YOLO-v4 detector to locate the object region in a form of a bounding box, and then extract accurate object contour through a region-level segmentation. The output is the feature points exactly within the object body.
In the second stage, we compose an edge-point constraint to further refine the feature points. We use Optical Flow to track the refined feature points. We propose a relative motion model with respect to the ego-vehicle and a moving object, and accordingly establish an EKF model for parameter estimation. We also apply the random sample consensus (RANSAC) algorithm to eliminate outliers of the tracked points. The EKF model integrates the ego-motion, optical flow, and disparity to generate optimized object position and velocity.

3.1. Object Detection and Region-Level Segmentation

Object detection is to locate the object region while segmentation is to determine foreground pixels (the object body) within the region. Figure 3 shows a process of object detection and segmentation.
Figure 3. An illustration of object/background confidence maps and segmentation results. (a) Bounding box detected by YOLO-v4. (b) Foreground confidence map. Red represents higher confidence value while blue represents lower value. (c) Background confidence map. (d) Segmentation result generated by GrabCut. (e) Super-pixels generated by Simple Linear Iterative Clustering. (f) Segmentation result after applying CRF.
We employ a YOLO-v4 detector to locate the object region. The details of YOLO-v4 can be found in Reference [16]. The detection result is in a form of a bounding box contains background, as shown in Figure 3a.
Region-level segmentation consists of three stages including Grabcut, Super-pixels, and Super-pixels fixed by Conditional Random Field. Starting from the bounding box (Figure 3a) detected by YOLO-v4, we apply the GrabCut algorithm to segment foreground from background. GrabCut algorithm proposed in Ref. [29] is an interactive method that segments images according to texture and boundary information. When using GrabCut, we initially define the inner of the bounding box as foreground and the external as background, and accordingly build a pixel-level Gaussian Mixture Model to estimate the texture distribution of foreground/background. By an iterative process until convergence, we can obtain the confidence maps of the foreground and background. The results are shown in Figure 3b,c.
Accordingly, GrabCut assigns a label ( γ u v ) to pixel u , v as follows:
γ u v = 1 ,       if   u , v   is   foreground 0 ,       if   u , v   is   background
The result is shown in Figure 3d in which the background is marked as black and the foreground is marked as red. This is a pre-segmentation process with some significant errors, for example, the license plate in Figure 3d is excluded from the car body.
We refine the pre-segmentation in virtue of Super-pixels idea proposed in Reference [30]. Super-pixels are an over-segmentation formed by grouping pixels based on low-level image properties including color, brightness, etc. Super-pixels provide a perceptually meaningful tessellation of image content, and naturally preserve the boundary of objects, thereby reducing the number of image primitives for subsequent segmentation. We adopt Simple Linear Iterative Clustering (SLIC) [31] to generate M super-pixels. SLIC is a simple-minded and easy-to-implement algorithm. It transforms the color image to CIELAB color space, constructs the distance metric based on coordinates and L/A/B color components, and adopts the k-means clustering approach to efficiently generate super-pixels. The label ϑ s β of a super-pixel s β is marked by Equation (2):
ϑ s β = 1 ,     u , v s β γ u v n u m 2 0 ,     other
where n u m is the total number of pixels within super-pixel s β . The generated super-pixels are shown in Figure 3e where the white lines partition the super-pixels. Super-pixels can greatly reduce computation load in the late stages.
Conditional Random Field (CRF) [32] is a discriminative probability model and is often used in pixel labeling. Supposing the output random variable constitutes a Markov random field, CRF is the extension of the maximum entropy Markov model. Since the labels of super-pixels can be regarded as such a random variable, we can use CRF to model the labeling problem. We define the CRF as an undirected graph with super-pixels as nodes. It can be solved through an approximate graph inference algorithm by minimizing an energy function. The energy function generally contains a unary potential and a pairwise potential. The unary potential is only related to the node itself and determines the likelihood of the node to be labeled as a class. The pairwise potential describes the interactions between neighboring nodes, and is defined as similarity between them. In this work, we employ CRF to fix the labels of super-pixels generated in Figure 3e. Two super-pixels are considered as neighbors if they share an edge in image space. Let s β and s j ( β ,   j = 1 , 2 , , M ) be neighboring super-pixels, the CRF energy function is defined as
E s e g Θ = s β u s β , ϑ s β + ( s β , s j ) ε p s β , s j
where ε denotes the set of all neighboring super-pixels. ϑ s β is the initial super-pixel label assigned in Equation (2). Θ represents the 1/0 labeling of super-pixels. The energy function is minimized by using graph cuts algorithm. We refer readers to [33] for a detailed derivation of the minimization algorithm.
The unary potential u s β , ϑ s β in Equation (3) measures the cost of labeling s β with ϑ s β :
u s β , ϑ s β = l o g C O F f g s β ,     i f   ϑ s β = 1 l o g C O F b g s β ,     i f   ϑ s β = 0
where C O F f g s β denotes the probability that s β belongs to the foreground, computed by averaging the foreground confidence scores (Figure 2b) over all pixels in s β . C O F b g s β is the probability that s β belongs to the background.
The pairwise potential p s β , s j in Equation (3) describes the interaction relationship between two neighboring super-pixels. p s β , s j incorporates the pairwise constraint by combining color similarity, the mean optical flow direction similarity and the depth similarity between s β and s j . p s β , s j is defined as
{ p ( s β , s j ) = λ 1 ( ϑ s β ϑ s j ) · D l a b ( s β , s j ) · D f l o w ( s β , s j ) · D d e p t h ( s β , s j ) D l a b ( s β , s j ) = 1 / ( 1 + l a b ( s β ) l a b ( s j ) 2 ) D f l o w ( s β , s j ) = F L s β F L s j / ( F L s β 2 F L s j 2 ) D d e p t h ( s β , s j ) = h i s t s β × h i s t s j
where λ is the weight used to adjust the pairwise potential function in E s e g . 1 · is an indicator function: if the input condition is true, the output is 1; otherwise, the output is 0. · 2 denotes the L 2-norm. D l a b s β , s j defines the color similarity between s β and s j . l a b s β is computed as the average LAB color of s β in CIELAB color space. F L s β is the mean optical flow of s β and D f l o w s β , s j represents the direction similarity between the mean flows of s β and s j . D d e p t h s β , s j is the depth similarity between s β and s j , measured using the Bhattacharyya distance. h i s t s β is the normalized depth histogram of s β . It can been seen that the pairwise potential integrates color, temporal (optical flow) and spatial (depth) information as criteria for segmentation purpose. The final segmentation result is shown in Figure 3f.

3.2. Tracking and Parameter Estimate

We use Optical Flow to track the feature points. We establish a relative motion model between ego-vehicle and object by taking camera ego-motion into considerations, accordingly build an EKF model for point tracking and parameter estimate. The EKF model integrates the ego-motion, optical flow, and disparity to generate optimized object position and velocity. During the tracking process, we compose an edge-point constraint to refine the feature points. During the parameter estimate, we apply the RANSAC algorithm to eliminate outliers of tracked points.

3.2.1. The Relative Motion Model of the Ego-Vehicle and the Object

Figure 4 shows the relative motion model between the ego-vehicle and a moving object.
Figure 4. Relative motion model. (a) The camera coordinates. (b) Relative motion between ego-vehicle and moving object.
The ego-vehicle and the object move on X-Z plane. Assuming that the ego-vehicle moves from position C1 to C2 within a time interval Δ t with a translational velocity V S = V X S , V Y S , V Z S T and a rotational velocity around Y-axis ω S , the trajectory can be regarded as the arc C1C2 with a rotation angle α = ω S × Δ t . The displacement Δ L S = Δ X S , 0 , Δ Z S T in the camera coordinates at position C2 will be:
Δ L S = Δ X S 0 Δ Z S = V S 2 ω S 1 c o s α 0 s i n α
The object P is located at C3 at time t, and the absolute velocities in the camera coordinates at position C1 is V t O = V X t O , V Y t O , V Z t O T . Assuming that the object moves from C3 to C4 with V t O within Δ t , the absolute velocities V t + Δ t O of P at time t + Δ t is related to the change of the camera coordinates. Taking ego-vehicle motion into considerations, the displacement Δ L O and V t + Δ t O of P in the camera coordinates at C2 are computed from:
Δ L O = V t + Δ t O × Δ t
V t + Δ t O = R α V t O
where R α is the rotation matrix given by the Rodrigues rotation formula:
R α = c o s α 0 s i n α 0 1 0 s i n α 0 c o s α
Thus, given the coordinates of P in the camera coordinates at C1 at time t  P t = X t , Y t , Z t T , the coordinates of P in the camera coordinates at C2 at time t + Δ t   P t + Δ t is calculated by:
P t + Δ t = R α × P t + Δ L O + Δ L S

3.2.2. Design of Kalman Filter

(1) Motion Model
The state vector for P is defined as
S V = X , Y , Z , V X O , V Y O , V Z O T
where X , Y , Z T represents the coordinates of P in the moving camera coordinates. V X O , V Y O , V Z O T is the absolute velocities of P moving along the X-axis, Y-axis and Z-axis.
Combing Equations (6) – (8) and (10), The time-discrete motion equation for the state vector S V is given by:
S V k = A × S V k 1 + B k 1 + δ k
A = R α Δ t × R α 0 R α
B k 1 = V k 1 S ω S 1 - c o s α 0 s i n α 0 0 0
where k is the time index, the process noise δ k is considered as Gaussian white noise with a mean value of zero.
(2) Measurement Model
The measurement vector for P is M V = u , v , d T where u , v is the projection, and d is the disparity. The optical flow is used to track P k u , v at time k to P k + 1 u , v at time k+1, and the corresponding disparities d k and d k + 1 can be measured from the stereovision.
According to the ideal pinhole camera model, the nonlinear measurement equation can be written as:
M V k = H S V k + ϵ k
H S V k = u = f u × X Z + c u v = f v × Y Z + c v d = b × f u Z
where ϵ k is the Gaussian measurement noise. f u , f v are the camera focal lengths; c u , c v are the camera centre offsets and b is the camera baseline length. The Jacobian matrix of measurement equation can be expressed as
J = f u Z 0 f u × X Z 2 0 0 0 0 f v Z f v × Y Z 2 0 0 0 0 0 b × f u Z 2 0 0 0
(3) Estimation and Update
The location and absolute velocities of P can be obtained by iterating the following estimation and update process. The time update equations are:
S V k = A × S V k 1 + B k 1
P k = A P k 1 A T + Q k
where S V k is the priori estimate of the state vector S V at time k, S V k 1 is the posteriori estimate (optimal value) of the state vector S V at time k-1, P k is the priori estimate of the variance of the estimation error, Q k is the covariance of δ k .
The measurement update equations are
G k = P k J k T J k P k J k T + W k 1
S V k = S V k + G k M V k H S V k
P k = I G k J k P k
where G k is the Kalman gain, W k is the covariance of ϵ k , I is the identity matrix, S V k is the posteriori estimate (optimal value) of the state vector S V at time k, and P k is the posteriori estimate of the variance of the estimation error.

3.2.3. Feature-Point Filtering

The tracking discussed in the above EKF is for a single object point. As described in Section 3.1, each segmented object consists of a cluster of points, i.e., a set of foreground pixels. For sake of tracking reliability and computation efficiency, it is essential to select reliable feature points for tracking and estimation. The motion state of an object is taken as the average of these points. Feature-point filtering is crucial for tracking and estimation.
Since the edge points have a strong textural feature and facilitate optical flow calculations, we employ the Canny operator [34] to extract the edge points as feature points. During the tracking, we compose an edge-point constraint on the tracking results. That is, the tracked points must still be edge points, otherwise, they are excluded.
Furthermore, we enhance estimation accuracy by applying the RANSAC algorithm [17] to eliminate outliers of tracking points. The RANSAC is a statistics-based hypothesis-verification method that iteratively finds the inner data from noisy data. In each iteration, a minimum number of samples is randomly selected to construct a consistency hypothesis, and other samples are verified whether they conform to the hypothesis. The samples that conform are taken as inner samples. Repeat the above steps to form a sample set with the largest number of inner samples, i.e., the maximum consensus set, for calculation of the motion parameters.
We compose a consistency constrain on the estimate results, that is, the estimate results for feature points in the same object should be consistent. In this work, the longitudinal distance and velocity, the lateral distance and velocity are used as target parameters to iteratively select the inner data set. The implementation flow for the RANSAC filtering is illustrated in Algorithm 1.
Algorithm 1. Implementation flow for RANSAC filtering (example of longitudinal distance).
Input: A set of feature points: FR       The maximum iterations: I m a x
           Consistency threshold th, i.e., the threshold of the deviation that is the difference between longitudinal distance and its            average.
Output: The maximum consensus set: m a x The object longitudinal distance: ζ final
i = 0 , N m a x = 0
while  i < I m a x do
            1 Hypothesis generation
            Randomly select m feature points form FR as minimal consensus set
            Calculate the average longitudinal distance ζ Z in the minimal consensus set
            2 Verification
            Calculate the difference between the longitudinal distance of each point in FR and ζ Z , i.e., deviations
            Determine a set F R i whose deviations are less than th
             Count the total number of F R i as N
            If  N > N m a x then
                       m a x = F R i , N m a x = N
            end if
            i = i + 1
end while
            Calculate the average longitudinal distance in m a x as ζ final

4. Experiments

Experiments have been conducted on image sequences (Road and City) of the KITTI public datasets [35]. The binocular camera settings are: baseline length 0.54 m, mounting height 1.65 m, tilt angle to the ground 0°, and rectified image resolution 375 × 1242. KITTI provides the ground truth of ego-vehicle motion, motion state of moving objects. The experiments were implemented in the workstation with an Intel Xeon Silver 4110 4 core processor, 16GB RAM, a Nvidia GeForce gtx1080ti graphic processor, and 11 GB video memory.
We use HD3-flow [36] for predicting optical flow and employ PSMNet [37] to generate the disparity maps. We retrained HD3-flow and PSMNet based on the original weights using the KITTI dataset.

4.1. Segmentation Results

We compare our segmentation method with two state-of-the-art methods, PSPNet [25] and YOLACT++ [28]. The results of three methods compared with the ground truth are shown in Figure 5. The fourth row shows the results obtained by our method only using color information without using optical flow, and disparity, called “Our method1”.
Figure 5. The segmentation results of three methods and ground truth in two traffic scenarios, including (1) Road scene; (2) City scene. Our method1 indicates that our method only using color information without using optical flow, and disparity.
In the road scene, it can be seen that our method can accurately segment the Obj. 1, 2, and 3. Our method1 fails in recognizing the license plates and lights of Obj. 2 and 3 as part of the car bodies. PSPNet wrongly mixtures Obj. 2 and 3 together while YOLACT++ wrongly mixtures distant building with Obj. 2 into one object.
In the city scene, our method also achieves the best result either in frame 4 (no-occlusion case) or frame 9 (occlusion case). Especially, our method is able to accurately distinguish Obj. 4 from the traffic light poles in frame 9. PSPNet presents significant errors either on frame 4 or frame 9 while YOLACT++ fails to segment Obj. 4 from the traffic poles. Our method1 does not correctly segment the front windshield of Obj.4 in frame 9, while the wheels are excluded from the car body in frame 4 and 9.
We use four metrics to quantitatively evaluate the segmentation performance.
Mean Intersection over Union (MIoU) [38]: It computes a ratio between the intersection and the union of the ground truth and predicted segmentation.
M I o U = 1 l l 1 = 0 1 C N l 1 l 1 l 2 = 0 1 C N l 1 l 2 + l 2 = 0 1 C N l 2 l 1 C N l 1 l 1
where l is the class number, in this case l = 2 (foreground/background). C N l 1 l 2 is the number of pixels of class l 1 inferred to belong to class l 2 and by parity of reasoning. When calculating MIoU, l 1 and l 2 are regarded as foreground (1) or background (0) respectively to count the positive and negative pixels. Thus, l 1 and l 2 refers to be 1 or 0.
The False Positive Rate (FPR) and the False Negative Rate (FNR) are computed by
F P R = F P F P + T N
F N R = F N F N + T P
where True Positive (TP) and False Positive (FP) indicate the correctly and incorrectly segmented positive (foreground) pixels, while the True Negative (TN) and False Negative (FN) indicate the correctly and incorrectly segmented negative (background) pixels.
Overall error (Ov. err.) is the percentage of wrongly labelled pixels.
Since the KITTI doesn’t provide the ground truth of the instance segmentation, we manually labeled 411 images from the Road and City sequences. We conducted experiments on those images, and the average values of the metrics are listed in Table 1. It can be seen that our method achieves the best MIoU score and the lowest FPR, FNR, and Ov.err., and outperforms other methods.
Table 1. Comparison of segmentation performance of three methods.
The reasons for the superior performance of our method are: (1) our method segments object from candidate region (bounding box) pre-generated by YOLO-v4 detector rather than from the whole image, which eliminates trivial information and makes segmentation easy; (2) our method combines color, temporal (optical flow), and spatial (depth) information as the basis for segmentation; and (3) super-pixels naturally preserve the boundary of objects and are computationally efficient for processing.

4.2. Results of Feature-Point Filtering

As described in Section 3.2.3, the edge points within the object point cluster are used as feature points for tracking. The edge point constraint and consistency constraint are applied to filter the feature points. Taking a segmented object as an example, the filtering processing is shown in Figure 6. Figure 6a shows the point cluster obtained from our region-level segmentation, and Figure 6b shows the edge points extracted by the Canny operator, which are taken as feature points. The yellow points in Figure 6c are the feature points in Figure 6b (previous frame) that are tracked to the current frame while the white points are the edge points in the current frame. Some of the yellow points do not overlap the white points and should be eliminated. The blue points in Figure 6d are the result of excluding the non-overlapping points, i.e., satisfying the edge point constraint. The results of applying the consistency constraint on Figure 6d are shown in Figure 6e. The red points are the feature points with consistent distances and velocities that have been selected by the RANSAC, i.e., the maximum consensus set. The arrows in Figure 6f represent the optical flows of the valid feature points. It can be seen that the optical flows are identical, indicating a valid feature point selection.
Figure 6. The processing of the feature-point filtering. (a) Object point cluster; (b) Edge points; (c) Tracked feature points in current frame; (d) Result of applying edge point constraint; (e) Result of applying consistency constraint; (f) Optical flow of valid feature points.

4.3. Motion Estimate Results and Analysis

4.3.1. Motion Estimate Results

Table 2 lists the estimates and errors of position and absolute velocity of the objects in Figure 5. It can be seen that our method presents small errors. In the Road scene, objects have little variation in their lateral positions and mainly move longitudinally. The maximum absolute errors of the objects’ longitudinal position and velocity estimations are 0.4 m (Obj. 3, corresponding to the ground truth 37.5 m) and 0.6 m/s (Obj. 3, corresponding to −13.0 m/s), respectively. In the City scene, the objects are mainly moving laterally from left to right. The maximum absolute errors of the objects’ lateral position and velocity estimation are −0.2 m (Obj. 4 in frame 4, corresponding to −9.9 m) and −0.3 m/s (Obj. 5 in frame 4, corresponding to 10.8 m/s), respectively.
Table 2. Position and absolute velocity of objects in Figure 5 and their estimate errors.
Figure 7 shows the results of object segmentation and motion estimation in three frames of the Road scene sequence. P X , Z indicates the lateral and longitudinal distances of objects in the camera coordinates, while V V X , V Z denotes the lateral and longitudinal absolute velocities. Starting from frame 210 to frame 291, the red car that moves in the same direction as the ego-vehicle is tracked. At frame 210, it is 29.8 m away from the ego-vehicle with a longitudinal velocity of 13.8 m/s. At frame 210, it is getting far with a distance of 33.9 m and a velocity of 15.2 m/s. At frame 291, it is getting closer with a distance of 32.3 m and a velocity of 13.7 m/s. At the same time, other vehicles on the road (as shown in the blue, green, brown, and purple masks) are also segmented, tracked and predicated with their motion states.
Figure 7. Object segmentation and motion estimate in three frames of Road scene.

4.3.2. Evaluation and Comparison

We tested our method against the ground truth over a sequence of images. We evaluated our method in terms of: (1) the method with feature-point filtering (w Ft.Pts.F.); (2) the method without feature-point filtering (w/o Ft.Pts.F.). Figure 8 shows the variations of lateral distance and velocity of Obj. 4 from frame 4 to 23 in the City scene. It moves almost uniformly from left to right, the lateral distance becomes progressively larger and the lateral absolute velocity is approximately constant. Figure 9 shows the variations of longitudinal distance and velocity of Obj. 2 from frame 4 to 294 in the Road scene. It moves in the same direction as the ego-vehicle. It can also be seen that the variations of the w Ft.Pts.F. method are closer to the ground truth and smoother than the w/o Ft.Pts.F. method. This indicates that the performance of our method is improved by using feature-point filtering.
Figure 8. Variations of lateral distance and velocity of Obj. 4 over frame 4–23 in City scene. (a) The lateral distance; (b) The lateral absolute velocity.
Figure 9. Variations of longitudinal distance and velocity of Obj. 2 over frame 4–294 in Road scene. (a) The longitudinal distance; (b) The longitudinal lateral absolute velocity.
There is no uniform evaluation metric for object motion estimation. One of the commonly used metrics is the root mean squared error (RMSE) over a sequence of images. The RMSE is defined as
R M S E * = c = 1 N F m t c * g t c * 2 N F
where NF refers to the number of frames that at least one object is being tracked in a sequence, m t c the estimate and g t c the ground truth. * represents the parameters used for evaluation including the lateral distance X, the lateral velocity V X , the longitudinal distance Z and the longitudinal velocity V Z . For example, R M S E X is the root mean square error of the lateral distance X. Therefore, we compared our method with other three start-of-the-art works [13,21,24] that also used the RMSE as evaluation metric. Table 3 lists the comparison results.
Table 3. The R M S E comparisons of our method with the state-of-the-art methods.
As can be seen in Table 3, our proposed method with feature-point filter brings significant improvement compared with the other methods, particularly in the RMSE of distance. It can also be seen the performance of our method is improved by using the feature-point filtering.
To evaluate the effect of each component in the proposed method on motion estimation, we have conducted an ablation study on different versions of the method. The results are summarized in Table 4.
Table 4. Quantitative comparison of different versions of our method in ablation study. RLS indicates our region-level segmentation method, Ft.Pts.F. denotes feature-point filtering method, EKF model represents the designed Kalman filter in Section 3.2.2, w/o means without and w is short for with. A tick mark indicates the corresponding term is included in the counterpart.
Comparing other rows with the first row which is the standard version, it can be seen how each component contributes to improving the RMSE values. Comparing the second row (using bounding box rather than region-level segmentation) with the first row demonstrates the proposed region-level segmentation method can significantly improve the results. Comparing the third row with the first row demonstrates the effect of the feature-point filtering. Comparing the fourth row with the first row demonstrates that the EKF model is effective.

5. Conclusions

In this work, we adopt three strategies to achieve accurate and robust motion estimation for autonomous driving. (1) Instead of bounding boxes, we use segmented object regions as object proposals for tracking and parameter estimates. We propose a region-level segmentation to accurately locate object contour and determine points within the objects. (2) We compose an edge-point constraint on the feature points and apply the random sample consensus algorithm to eliminate outliers of tracking points so that the points used for tracking are ensured within the object body and the parameter estimate are refined as inner points. 3) We develop a relative motion model of the ego-vehicle and the object, and accordingly establish an EKF model for point tracking and parameter estimate. The EKF model takes the ego-motion into considerations and integrates the ego-motion, optical flow, and disparity to generate optimized motion parameters. Substantial experiments have been conducted on the KITTI dataset, and the results demonstrate that our region-level segmentation presents excellent performance and outperforms the state-of-the-art segmentation methods. For the motion estimation, our proposed method presents a superior performance on RMSE compared to the other state-of-the-art methods.

Author Contributions

Conceptualization, H.W. and Y.H.; methodology, H.W. and Y.H.; software, H.W.; validation, H.W. and Y.H.; formal analysis, H.W.; investigation, H.W.; writing—original draft preparation, H.W. and Y.H.; writing—review and editing, H.W., Y.H., F.H., B.Z., Z.G., and R.Z.; project administration, Y.H.; funding acquisition, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Nature Science Foundation of Shanghai Science and Technology Commission, China, grant number 20ZR1437900, and National Nature Science Foundation of China, grant number 61374197.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Trubia, S.; Curto, S.; Severino, A.; Arena, F.; Zuccalà, Y. Autonomous vehicles effects on public transport systems. AIP Conf. Proc. 2021, 2343, 110014. [Google Scholar] [CrossRef]
  2. Curto, S.; Severino, A.; Trubia, S.; Arena, F.; Puleo, L. The effects of autonomous vehicles on safety. AIP Conf. Proc. 2021, 2343, 110013. [Google Scholar] [CrossRef]
  3. Arena, F.; Ticali, D. The development of autonomous driving vehicles in tomorrow’s smart cities mobility. AIP Conf. Proc. 2018, 2040, 140007. [Google Scholar] [CrossRef]
  4. Arena, F.; Pau, G.; Severino, A. An Overview on the Current Status and Future Perspectives of Smart Cars. Infrastructures 2020, 5, 53. [Google Scholar] [CrossRef]
  5. Brummelen, J.V.; O’Brien, M.; Gruyer, D.; Najjaran, H. Autonomous vehicle perception: The technology of today and tomorrow. Transp. Res. C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
  6. Bersani, M.; Mentasti, S.; Dahal, P.; Arrigoni, S. An integrated algorithm for ego-vehicle and obstacles state estimation for autonomous driving. Robot. Auton. Syst. 2021, 139, 103662. [Google Scholar] [CrossRef]
  7. Geng, K.; Dong, G.; Yin, G.; Hu, J. Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving. Remote Sens. 2020, 12, 3274. [Google Scholar] [CrossRef]
  8. Jain, D.K.; Jain, R.; Cai, L.; Gupta, M.; Upadhyay, Y. Relative Vehicle Velocity Estimation Using Monocular Video Stream. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2020), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
  9. Kuramoto, A.; Aldibaja, M.A.; Yanase, R.; Kameyama, J.; Yoneda, K.; Suganuma, N. Mono-Camera based 3D Object Tracking Strategy for Autonomous Vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 459–464. [Google Scholar] [CrossRef]
  10. Lim, Y.-C.; Lee, M.; Lee, C.-H.; Kwon, S.; Lee, J.-H. Improvement of stereo vision-based position and velocity estimation and tracking using a stripe-based disparity estimation and inverse perspective map-based extended Kalman filter. Opt. Lasers Eng. 2010, 48, 859–868. [Google Scholar] [CrossRef]
  11. Liu, Z.; Lu, D.; Qian, W.; Ren, K.; Zhang, J.; Xu, L. Vision-based inter-vehicle distance estimation for driver alarm system. IET Intell. Transp. Syst. 2019, 13, 927–932. [Google Scholar] [CrossRef]
  12. Vatavu, A.; Danescu, R.; Nedevschi, S. Stereovision-Based Multiple Object Tracking in Traffic Scenarios Using Free-Form Obstacle Delimiters and Particle Filters. IEEE Trans. Intell. Transp. Syst. 2015, 16, 498–511. [Google Scholar] [CrossRef]
  13. Hayakawa, J.; Dariush, B. Ego-motion and Surrounding Vehicle State Estimation Using a Monocular Camera. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2550–2556. [Google Scholar] [CrossRef]
  14. Min, Q.; Huang, Y. Motion detection using binocular image flow in dynamic scenes. EURASIP J. Adv. Signal Process. 2016, 2016, 49. [Google Scholar] [CrossRef]
  15. Cao, Z.; Kar, A.; Häne, C.; Malik, J. Learning Independent Object Motion From Unlabelled Stereoscopic Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 5587–5596. [Google Scholar] [CrossRef]
  16. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Available online: https://arxiv.org/abs/2004.10934v1 (accessed on 23 April 2020).
  17. Raguram, R.; Chum, O.; Pollefeys, M.; Matas, J.; Frahm, J. USAC: A Universal Framework for Random Sample Consensus. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2022–2038. [Google Scholar] [CrossRef]
  18. Garcia, F.; Martin, D.; de la Escalera, A.; Armingol, J.M. Sensor Fusion Methodology for Vehicle Detection. IEEE Intell. Transp. Syst. Mag. 2017, 9, 123–133. [Google Scholar] [CrossRef]
  19. Barth, A.; Franke, U. Estimating the Driving State of Oncoming Vehicles From a Moving Platform Using Stereo Vision. IEEE Trans. Intell. Transp. Syst. 2009, 10, 560–571. [Google Scholar] [CrossRef]
  20. He, H.; Li, Y.; Tan, J. Relative motion estimation using visual–inertial optical flow. Auton. Rob. 2018, 42, 615–629. [Google Scholar] [CrossRef]
  21. Zhang, J.; Henein, M.; Mahony, R.; Ila, V. Robust Ego and Object 6-DoF Motion Estimation and Tracking. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2020), Las Vegas, USA, 24–29 October 2020; pp. 5017–5023. [Google Scholar]
  22. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
  23. Kim, K.; Choi, W.; Koh, Y.J.; Jeong, S.; Kim, C. Instance-Level Future Motion Estimation in a Single Image Based on Ordinal Regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 27 October–2 November 2019; pp. 273–282. [Google Scholar] [CrossRef]
  24. Song, Z.; Lu, J.; Zhang, T.; Li, H. End-to-end Learning for Inter-Vehicle Distance and Relative Velocity Estimation in ADAS with a Monocular Camera. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2020), Paris, France, 31 May–31 August 2020; pp. 11081–11087. [Google Scholar] [CrossRef]
  25. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
  26. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
  27. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar] [CrossRef]
  28. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++: Better Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar]
  29. Rother, C.; Kolmogorov, V.; Blake, A. GrabCut: Interactive foreground extraction using iterated graph cuts. ACM Trans. Graphics 2004, 39, 309–314. [Google Scholar] [CrossRef]
  30. Jampani, V.; Sun, D.; Liu, M.-Y.; Yang, M.-H.; Kautz, J. Superpixel Sampling Networks. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 352–368. [Google Scholar] [CrossRef]
  31. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef]
  32. Wei, L.; Yu, M.; Zhong, Y.; Zhao, J.; Liang, Y.; Hu, X. Spatial–Spectral Fusion Based on Conditional Random Fields for the Fine Classification of Crops in UAV-Borne Hyperspectral Remote Sensing Imagery. Remote Sens. 2019, 11, 780. [Google Scholar] [CrossRef]
  33. Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
  34. Yuan, W.; Zhang, W.; Lai, Z.; Zhang, J. Extraction of Yardang Characteristics Using Object-Based Image Analysis and Canny Edge Detection Methods. Remote Sens. 2020, 12, 726. [Google Scholar] [CrossRef]
  35. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  36. Yin, Z.; Darrell, T.; Yu, F. Hierarchical Discrete Distribution Decomposition for Match Density Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 6037–6046. [Google Scholar] [CrossRef]
  37. Chang, J.; Chen, Y. Pyramid Stereo Matching Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar] [CrossRef]
  38. Alberto, G.-G.; Sergio, O.-E.; Sergiu, O.; Víctor, V.M.; Pablo, M.G.; Jose, G.-R. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.