Moving Object Detection Based on Optical Flow Estimation and a Gaussian Mixture Model for Advanced Driver Assistance Systems

Most approaches for moving object detection (MOD) based on computer vision are limited to stationary camera environments. In advanced driver assistance systems (ADAS), however, ego-motion is added to image frames owing to the use of a moving camera. This results in mixed motion in the image frames and makes it difficult to classify target objects and background. In this paper, we propose an efficient MOD algorithm that can cope with moving camera environments. In addition, we present a hardware design and implementation results for the real-time processing of the proposed algorithm. The proposed moving object detector was designed using hardware description language (HDL) and its real-time performance was evaluated using an FPGA based test system. Experimental results demonstrate that our design achieves better detection performance than existing MOD systems. The proposed moving object detector was implemented with 13.2K logic slices, 104 DSP48s, and 163 BRAM and can support real-time processing of 30 fps at an operating frequency of 200 MHz.


Introduction
Advanced driver assistance systems (ADAS) represent the most popular field in the automotive industry and have become a key technology for modern vehicle safety and driving comfort [1,2]. The most commonly used ADAS techniques include adaptive cruise control, collision warning and lane change assistance. Collision warning systems are one of the major applications of ADAS and their task is to inform drivers of obstacles around the vehicle by giving visual, aural, or tactile feedback [3,4]. Reliable moving object detection (MOD) technology is an essential part of collision warning systems and various sensor-based techniques have been proposed, such as vision-, lidar-and radar-based techniques [5,6]. Since vision-based MOD technology is relatively more intuitive and cheaper than active sensor techniques, such as radar and lidar, many vision-based algorithms have been proposed .
However, several vision-based MOD algorithms assume input image frames to be captured by stationary camera, because detection accuracy degrades for moving cameras [8][9][10][11][12][13][14]. Although methods for generating background models using smart cameras have also been proposed, they suffer from a limitation in that they can only be used in stationary camera environments [15][16][17]. Without proper distinction, the mixed motion between background and foreground objects that is caused by moving cameras is hard to distinguish. Using a moving camera is inevitable in vehicle environments and thus an efficient algorithm that can cope with this fact is needed.
For this reason, several algorithms that allow for MOD in the image frames obtained by moving cameras have recently been proposed [18][19][20][21][22][23][24][25][26][27][28][29]. Some of these methods employ advanced statistical models or outlier detection techniques to estimate the background [18][19][20][21][22][23][24]. In Reference [22], background subtraction based on a 2.5D background model was proposed, which could be used to successfully detect moving objects in complex scenes. In Reference [23], a fast and effective MOD algorithm based on global motion compensation and adaptive background modelling was presented, which supports real-time object detection on a moving camera. The method proposed in Reference [24] introduces a novel background modelling approach based on dynamic reverse analysis (DRA). This approach can handle illumination variations, occlusions and camera instability. A comprehensive overview of the recent efforts made for dealing with illumination variations and occlusions can be found in Reference [25]. Recently, algorithms that rely on deep neural networks (DNNs) have also been proposed [26][27][28]. In the method proposed in Reference [28], a single CNN is trained using stacked depth-wise image-background pairs, and its output is enhanced via post processing. However, these algorithms may not be suitable for ADAS applications with limited power consumption and available area because of the associated high computational complexity and massive memory requirements. In addition, algorithms based on learning methods cannot deal with unexpected situations because the image frames used for learning are restricted to specific cases [29]. In particular, the use of moving cameras results in a wide variety of possible situations, which seriously degrades MOD performance.
On the other hand, a technique for compensating the camera motion, referred to as ego-motion, without relying on learning methods has been proposed [29]. The MOD algorithm proposed in Reference [29] estimates the flow vector of each pixel via a Lucas Kanade (LK)-based optical flow estimation (OFE) algorithm and analyzes the histogram of the flow vectors to estimate the ego-motion. Then, the background model is compensated using the estimated ego-motion. The compensated background model is used in the MOD process to separate out the mixed motion between the background and foreground objects. This algorithm can prevent many false positives and shows good precision in moving camera environments. However, there is a problem in that its performance in terms of recall is degraded. Therefore, its applications are limited to backup collision intervention (BCI), which consists in detecting obstacles behind the vehicle, because such tasks require relatively slow camera motion, as explained in Reference [29]. This drawback arises owing to the use of LK-based OFE for ego-motion estimation. LK-based OFE is a local method for estimating flow vectors based on the assumption that the motion of a local region is the same within itself. A limitation of this method is that it cannot find the correct flow vectors in regions where the brightness pattern is uniform. Therefore, inaccurate flow vectors estimated in the background regions with uniform brightness patterns have an adverse effect on ego-motion estimation.
In order to overcome the problems associated with the use of local methods, such as LK-based OFE, Horn and Schunk (HS) proposed a global method for defining the energy function of an entire image frame and for estimating the flow vectors while minimizing this energy function [30]. However, this method cannot cope with sudden changes in brightness and various algorithms have been proposed to solve this problem [31][32][33]. Among them, the Brox algorithm shows robustness to changes in brightness by extending the conventional OFE brightness constancy assumption to a gradient constancy assumption. It also shows higher accuracy than other algorithms in vehicle environments.
In this paper, we propose an efficient algorithm for finding moving objects after more precisely estimating and compensating ego-motion using the Brox OFE algorithm. In addition, we present a hardware structure design and its results for real-time processing tasks. The remainder of this paper is organized as follows: Section 2 explains the Brox OFE algorithm and the Gaussian mixture model (GMM). Section 3 presents the proposed MOD algorithm, and Section 4 describes the hardware architecture of the proposed moving object detector. Section 5 presents the results for an FPGA implementation of the proposed moving object detector and performance evaluation results. Finally, Section 6 concludes the paper.

Brox Optical Flow Estimation
Let x = (x, y, t) T and w = (u, v, 1) T , where x denote pixel information at time t and w represents the estimated flow vector. Then, global deviations from the grey value constancy assumption and the gradient constancy assumption can be measured by the energy as where γ is a weight between both assumptions. An increasing concave function ψ(s 2 ) = (s 2 + 2 ) is applied, leading to a robust energy estimation [31], where is small positive constant that is typically set as 0.001. Smoothness describes the model's assumption of a piecewise smooth flow field and can be expressed as follows: The spatio-temporal gradient ∇ 3 = (δ x , δ y , δ t ) T indicates that a spatio-temporal smoothness assumption is involved. The total energy is the weighted sum of (1) and (2), and it can be depicted by where α is a regularization parameter. The value of (u, v) that minimizes the total energy is the optimal flow vector. The Euler-Lagrange equation is applied to (3) to find the optimal solution. To linearize the equation, Brox applied a numerical approximation. A multi-scale approach, also called a pyramid, and an inner iteration loop are applied to suppress the non-linearity of the remaining ψ term [31]. Then, the final optimal equation is converted into matrix form for all the pixels within an image frame as follows: (ψ ) k,l Data · I k x I k x + γ(I k xx I k xx + I k xy I k xy ) I k y I k x + γ(I k yy I k yx + I k xy I k xy ) I k x I k y + γ(I k xx I k xy + I k xy I k yy ) I k y I k y + γ(I k yy I k yy + I k xy I k xy ) · du k,l+1 dv k,l+1 = − (ψ ) k,l Data · I k x I k t + γ(I k xx I k xt + I k xy I k yt ) I k y I k t + γ(I k yy I k yt + I k xy I k xt ) where I x , I y , I t , I xx , I xy , I xt , and I yt represent the gradient for each direction; k is the pyramid loop; and l indicates the iteration loop for linearization in each pyramid loop [31]. In (4), (ψ ) Data and (ψ ) Smooth are obtained as follows: (ψ ) k,l Data = ψ (I k t + I k x du k,l + I k y dv k,l ) 2 + γ (I k xt + I k xx du k,l + I k xy dv k,l ) 2 + (I k yt + I k xy du k,l + I k yy dv k,l ) If du k,l+1 and dv k,l+1 in the left-hand side of (4) are denoted as vector w, the remaining term as matrix A, and the reight-hand side of (4) as vector b, then, (4) can be expressed as Aw = b, and the final solution for d can be calculated as

Gaussian Mixture Model
The GMM algorithm, which was proposed by Stauffer and Grimson [10], estimates the background image by employing a statistical model of intensity for each pixel in the image frame. Several works have been carried out to improve its performance, and the GMM has been widely adopted as a basic framework for generating background models [11][12][13][14]. Since the GMM algorithm available in the open-source computer vision software library (OpenCV) [34] has been optimized in terms of performance and complexity, it has been adopted in various FPGA implementations [35][36][37] and is also applied in our proposed algorithm.
The GMM algorithm is composed by a mixture of n Gaussian distributions represented by three parameters: weight (w), mean (µ) and variance (σ 2 ). The Gaussian distributions of each pixel have different parameters and change for each image frame. Therefore, these parameters are defined by three indices, namely i, n, and t, where i is the index for pixel intensity, n denotes the index for the Gaussian distributions, and t is an index that refers to the time of the considered frame.
These parameters are updated differently depending on the match condition, which indicates whether a pixel is suitable for the background model. The match condition is checked against the n Gaussian distributions that model the pixel and is given by where D is a threshold whose value was experimentally chosen to be equal to 2.5. The background model of each pixel is generated in a grey scale ranging from 0 to 255 using the mean and the weight of the Gaussian model as follows:

Proposed Moving Object Detection Algorithm
Since the image frames obtained from moving cameras contain the motion of both the background and the objects, the proposed MOD algorithm performs a background compensation process before detecting objects, as shown in Figure 1. The background compensation process estimates the ego-motion and generates the compensated background. Then, the object detection process extracts the object coordinates using the current image frame and the compensated background. To minimize noise, two detection methods are performed and the final object coordinates are determined by cross-checking the results. After matching the resolution scales for two output coordinates, the final result is confirmed through an intersection operation.

Background Compensation
In order to estimate the ego-motion, the Brox OFE algorithm is applied to two consecutive frames to extract the flow vectors. Since the matrix inversion of A in (7) requires computation time proportionally to the image frame size and A is a Hermitian positive-definite matrix, we apply Cholesky factorization to A: where L is a lower triangular matrix. It is much easier to compute the inverse of a triangular matrix [38] and the inverse of the original matrix can be computed by simply multiplying the two inverses as follows: Then, the final flow vector w is calculated as The extracted flow vector for each pixel represents the motion of the pixel between consecutive frames. The pixels in the background region have a relatively slow motion compared with the object region. In addition, background pixels occupy a larger area in the image frame than objects and exhibit similar motion. Therefore, the most frequent flow vector with slow motion is regarded as the ego-motion, which can be extracted via histogram analysis. Then, the previously derived background model is compensated using the extracted ego-motion. In this approach, the background model is derived via the GMM algorithm and stored as GMM parameters.
After estimating the ego-motion, the entire GMM parameters can be shifted back along the determined motion, resulting in a compensated background model. First, the ego-motions e dx and e dy in the x-axis and y-axis directions, respectively, are divided into an integer part and a fractional part. The integer parts e i dx and e i dy can be obtained by rounding up e dx and e dy respectively, whereas the fractional parts e f dx and e f dy are computed as Then, the GMM parameters w n,t−1 , µ n,t−1 and σ 2 n,t−1 are shifted by e i dx and e i dy as shown in Figure 2.  (15)- (17) and the same is done for e f dy in the y-axis direction, as presented in (18)-(20): , 1 memory After the compensation process is complete, the new GMM parameters are updated through the GMM algorithm. Then, the final background model, that is, the compensated background model is generated via (9) using w n,t , µ n,t , and σ 2 n,t .

Object Detection
To extract object coordinates in an image frame using the compensated background model, the proposed MOD algorithm performs a background subtraction operation first, followed by Brox OFE. Background subtraction consists in separating moving objects from stationary background images. If the difference between the compensated background model and the current frame is larger than the threshold, it is classified as a moving object and the rest is classified as background. Although this approach can effectively detect objects, such a simple comparison results in false positives.
In order to solve this problem, Brox OFE between the compensated background and the current frame is performed to extract the flow vectors of all pixels. The extracted flow vectors that have different magnitude and direction from those of the background can be grouped into objects. To group the object regions from the overall vectors in a frame, a proper threshold should be determined. This threshold has to be chosen for each frame by considering the distribution of the flow vectors. Since we derived the detection results via background subtraction, the final detection results are determined via cross-checking with both sets of results to reduce the number of false positives. A median filter on the results is applied to remove relatively small objects, such as those caused by noise.

Hardware Architecture Design
In this section, we present the hardware architecture of the proposed moving object detector for real-time processing. Figure 3 shows a block diagram of the proposed moving object detector, which consists of an optical flow estimator, a camera motion estimator, a background detector and an object detector. The data stream of the image frame, which enters from the external camera module, is stored in the input frame buffers. Then, the pixel intensities of two consecutive frames i t−1 and i t are selected from these buffers to estimate flow vectors u e and v e for, in turn, estimating the ego-motion via the optical flow estimator. The histogram statistics of these flow vectors are analyzed by the camera motion estimator to extract the ego-motions e dx and e dy . In order to generate the compensated background B t , the background detector shifts the GMM parameters according to the estimated ego-motions, as explained in Section 3.1, and updates the corresponding parameters by applying the GMM algorithm. Using B t and the pixel intensities of current frame it, the object detector performs background subtraction, and the optical flow estimator simultaneously extracts new flow vectors u o and v o . These flow vectors are used by the object detector to classify the object region. Finally, the object coordinates are generated by combining the two sets of detection results.

Optical Flow Estimator
The optical flow estimator shown in Figure 4a is composed of a convolution unit (CU) for pre-processing, a resolution process unit (RPU) for computing the solution of the Euler-Lagrange equation, a warping unit, and an output decision unit. Since a multi-scale approach (also called pyramid) is required, the input frames are scaled to a lower resolution after Gaussian smoothing. Then, a gradient filtering module calculates I x , I y , I t , I xx , I xy , I xt , and I yt using the scaled image frames. The Gaussian smoothing, image scaling, and gradient filtering operations are grouped into the CU and have a shared structure in the convolution calculator to reduce hardware complexity. This is possible because they perform similar image filtering operations. Employing this shared structure reduces the number of multipliers by ten, that of adders by five, and that of line buffers by four, as shown in Figure 4b. After gradient filtering is complete, the RPU computes (4) to extract the flow vectors. Then, the warping unit generates higher resolution image frames using the previously scaled data and extracted flow vectors. The overall operation of the optical flow estimator is repeated during the pyramid loop.    (4). Since similar calculations are repeated in (4), (5), and (6), we employed a shared structure to reduce the number of operators and memory requirements. The CFU factorizes A as lower triangular matrixes L and L T and performs matrix inversion. These operations require excessive memory access, which depends on image frame size. Excessive memory access results in high power consumption and makes real-time processing impossible. Therefore, we apply a shift register bank, which can reduce the number of memory access operations by 95.75%.  Figure 6 depicts the camera motion estimator, which is composed of a location finder, a 7 × 128 decoder, a counter bank, and some calculators. The designed camera motion estimator analyzes the histogram of the flow vectors. The histogram is generated by dividing the entire range of the flow vectors into a series of intervals and then counting how many vectors fall into each interval. We divide the entire range into 128 intervals, considering the trade-off between hardware complexity and performance. Histogram analysis is performed using the counter circuits and the control signal of each counter is generated by the location finder and the 7 × 128 decoder. Finally, e dx and e dy are extracted by finding the maximum count value.

Background Detector
The background detector, which is shown in Figure 7, is composed of a camera motion compensator that performs compensation for the GMM parameters using e dx and e dy and a GMM-based background estimator that updates the GMM parameters and estimates the background B t . The camera motion compensator performs a shift operation with integer parts e i dx and e i dy and then interpolates the GMM parameters using fractional parts e  Figure 8 shows the object detector, which consists of a background subtractor, a threshold decision unit, an object memory, a median filter and an object decision unit. First, the absolute value of the difference between i t and B t is provided to the comparator and object candidates are generated by comparing this value with an experimentally determined threshold value. These background subtraction results are stored in object memory. Afterwards, the comparator generates object candidates using the flow vectors and the threshold which is determined according to the distribution of these vectors. The generated object candidates are also stored in the object memory, and median filtering is performed. Finally, the coordinates of the objects are detected by cross-checking the two sets of results in the object decision unit.

FPGA Implementation
The proposed moving object detector was designed using hardware description language (HDL) and implemented on a Xilinx Virtex5 FPGA device. As a result, the proposed moving object detector was implemented with 13.2K logic slices, 104 DSP48s, and 163 BRAM, as shown in Table 1. The comparison results between the proposed GMM-based background generator and previous GMM implementations [35,36] are presented in Table 2. The GMM-based background generator employed in the proposed design has a similar complexity to the method presented in Reference [36] and can be implemented using less resources than that presented in Reference [35].
Since the final object coordinates are generated at intervals of 6.67M clock cycles for an image resolution of 640 × 480, we confirmed that real-time processing at 30 fps is possible using an FPGA test system at 200 MHz. The total number of clock cycles is proportional to the resolution of the input image. Table 3 shows comparison results in terms of processing speed between this work and other MOD scheme that can perform real-time operation on moving camera environments. The results confirm that the proposed system is significantly faster in terms of processing speed (fps) than other schemes that can support real-time processing. In order to evaluate the performance of the proposed moving object detector in actual vehicle environment, an FPGA test platform was constructed and is shown in Figure 9. This verification platform included an FPGA device with the proposed moving object detector, a 640 × 480-resolution camera and an HDMI recorder.

Performance Evaluation
MOD performance metrics, namely precision (P r ), recall (R e ), and F-measure (F m ), were used to carry out a numerical comparison between existing and proposed algorithms and the proposed algorithm. These metrics are defined as follows: True positives (TP) represents the total number of actual object pixels that are recognized as an object and false negatives (FN) denotes the total number of actual object pixels that are erroneously recognized as background. False positives (FP) represents the total number of background pixels that are recognized as an object. Therefore, P r quantifies the precision of actual object pixels among all the pixels recognized by the algorithm as objects and R e quantifies the detection rate as the ratio of pixels recognized by the algorithm as object to actual object pixels. Table 4 shows the results obtained by applying existing MOD algorithms and the proposed moving object detector to 200 consecutive image samples with three vehicles moving to the right [39]. Two rank-constrained models [19,20] exhibited excellent recall performance, but their precision was low, which would give the driver many false alarms. Although the algorithm presented in Reference [29] exhibited a higher precision than those of References [19,20], its recall performance was lower. In contrast, the proposed moving object detector exhibited the same precision as the algorithm from Reference [29], minimized the number of false alarms, and had a recall performance of 95%, which is 17% higher than the algorithm from Reference [29]. Table 4. MOD performance comparison between the proposed moving object detector and other algorithms.

Algorithm Precision Recall F-Measure
Rank-constrained 1 [19] 0.95 0.92 0.9348 Rank-constrained 2 [20] 0.83 0.99 0.9030 Kim et al. [29] 0.98 0.78 0.8686 Proposed 0.98 0.95 0.9648 Figure 10 shows examples of the experimental results obtained after applying the proposed moving object detector to the image samples taken from a vehicle equipped with the FPGA platform shown in Figure 9. As can be seen from Figure 10, the proposed algorithm exhibited good object detection performance in a vehicle environment, and we confirmed that false positives hardly ever happened.

Conclusions
In this paper, we proposed a novel MOD algorithm, which can operate in moving camera environments. In addition, an area-efficient hardware design for the proposed algorithm was presented for real-time processing. Experimental results demonstrate the overall improvements achieved using the proposed algorithm in terms of precision, recall and F-measure, which are important features for ADAS applications. The proposed moving object detector was implemented with 13.2 K logic slices, 104 DSP48s, and 163 BRAM and an FPGA test platform was constructed for verification in a vehicle environment. Through this verification, we confirmed that the proposed moving object detector achieved higher accuracy than existing MOD algorithms and that it can support real-time processing at 30 fps and an operating frequency of 200 MHz.

Conflicts of Interest:
The authors declare no conflict of interest.