Multi-Model Estimation Based Moving Object Detection for Aerial Video

With the wide development of UAV (Unmanned Aerial Vehicle) technology, moving target detection for aerial video has become a popular research topic in the computer field. Most of the existing methods are under the registration-detection framework and can only deal with simple background scenes. They tend to go wrong in the complex multi background scenarios, such as viaducts, buildings and trees. In this paper, we break through the single background constraint and perceive the complex scene accurately by automatic estimation of multiple background models. First, we segment the scene into several color blocks and estimate the dense optical flow. Then, we calculate an affine transformation model for each block with large area and merge the consistent models. Finally, we calculate subordinate degree to multi-background models pixel to pixel for all small area blocks. Moving objects are segmented by means of energy optimization method solved via Graph Cuts. The extensive experimental results on public aerial videos show that, due to multi background models estimation, analyzing each pixel’s subordinate relationship to multi models by energy minimization, our method can effectively remove buildings, trees and other false alarms and detect moving objects correctly.


Introduction
Moving target detection for aerial video is one of the core technologies of UAV (Unmanned Aerial Vehicle) surveillance systems. This technology can be widely applied in military domains such as battlefield reconnaissance and surveillance, positioning and adjustment, damage assessment, electronic warfare, etc. Also, it can support civil purposes such as border patrol, nuclear radiation detection, aerial photography, aerial prospecting, disaster monitoring, traffic patrol, security surveillance, etc. Due to its wide application, low cost, high cost effectiveness, no risk of casualties, strong survival ability, good maneuvering performance and convenience, moving object detection algorithm for UAV aerial video has become a hot research topic in the computer field. Moving object detection from a UAV is an important research topic crossing image processing and vehicle control. The purpose of this research is to automatically obtain the target position and motion information based on aerial video. This study can not only make UAV's eyes more clear, but also guarantee the advanced processing and applications, such as behavior analysis and importance analysis.
We are faced with core difficulties in moving object detection for aerial video, such as motion mutation caused by UAV fast motion, low resolution noisy images, small target, low contrast, complex background, scale changes and occlusion, etc. With UAV development, researchers have proposed many algorithms to solve the above problems. However, most of these methods are under the registration-detection framework, which assumes that scenario only has a single background and will identify all the regions generating parallax error as targets. As a result, tracking failure usually happens in complex scenarios with multiple backgrounds, trees, buildings, etc. Therefore, the state of the art solutions in moving object detection cannot satisfy application need and it is developing new technology for complex scenes is necessary.
Automatic estimation of multiple background models for complex scenarios can provide a solution for perceiving the scene accurately. This paper first focuses on automatic estimation of multiple background models for complex scenarios. Then the pixels' motion information and subordinate degrees to multi-background models are analyzed by optical flow. The subordinate degree between a pixel and a background model refers to the degree a pixel and its correspondence fit the background model. Usually, the projection error can be used to measure the subordinate degree. The larger the projection error, the lower the subordinate degree is. Based on the neighborhood information and the subordinate degree, we segment the moving objects via energy minimization [1,2]. Since we estimate multiple background models and perceive complex scenes correctly, our method can detect moving objects accurately under viaducts and other complex backgrounds. Meanwhile, our algorithm can effectively remove buildings, trees and other false alarms and improve the locating precision. In addition, the adoption of energy minimization, which makes use of both the analysis of neighborhood continuity and subordinate degree, can significantly improve segmentation precision.
The rest of this paper is organized as follows. Section 2 summarizes and analyzes the related work in recent years. Section 3 proposes the moving object detection algorithm based on multi-model estimation for aerial video of complex scenarios. The experimental results are reported in Section 4, which demonstrate the accuracy and effectiveness of our approach. Finally, the conclusions are drawn in Section 5.

Related Work
Moving object detection for aerial video [3] has widely developed in the past few decades. The existing moving object detection algorithms for aerial video mainly include two categories [4,5]: one is the bottom-up method and the other one is the top-down method. The bottom-up method is also named as Data-driven method, which does not rely on prior knowledge and extracts the moving information directly from the image sequences. Top-down method, also named the model-driven algorithm, which relies on the constructed model or prior knowledge, performs the matching computing and solves the posterior probability in image sequences. In matching computing, the moving objects will be detected if the similarity distance is close enough. When computing the posterior probability, the state vector corresponding to the maximum posterior probability will be denoted as the current status of the moving objects.
Using bottom-up method to realize moving object detection for aerial video mainly includes three steps [6][7][8][9][10]. The first step is image matching [11][12][13], which performs the adjacent frames registration for image sequences. The second step is object detection. Frame difference or background difference is often used to detect change blobs and obtain moving objects after registration. The third step is object classification. There are two tasks in this step. One is to extract the detected moving objects. The other one is to recognize these objects.
The existing bottom-up algorithms for moving object detection include the classic COCOA system [14]. The procedure of this system contains image stabilization, frame difference and block tracking. However, this algorithm often fails in scenario scaling due to the Harris corner-based image stabilization. Cohen et al. [15,16] proposed a moving object detection and tracking system. First they aligned the images by estimating the affine transformation model iteratively. Then, the normalized optical flow field was applied for motion detection and the graph representation was constructed to resolve and maintain the dynamic template of moving objects. This system runs fast but it cannot solve the complex scaling scenarios. Ibrahim et al. [17] proposed the MODAT framework. Instead of Harris corner, they adopted SIFT (Scale-invariant feature transform) [18] features to fulfill the image matching. However, all of the above methods can only deal with simple background scenes and assume that only the moving objects can cause the parallax error. They tend to go wrong in complex multiple background scenarios, such as viaducts, buildings and trees. Chad et al. [19] proposed a moving object detection method for aerial video with low frame rate. They constructed an accurate background model to solve the object detection and the shadow problems. However, the application of this method is restricted because we need to know the camera calibration parameters in advance and start tracking objects manually. Shen et al. [20] proposed a moving object detection method for aerial video basing on spatiotemporal saliency. However, this method still cannot overcome the parallax error problem and the false alarm rate is high in complex scenarios. As shown in Figure 1b, false alarms (labeled by the red circles) occurred at buildings and trees when using one affine model to describe the scene. The real objects may be missed due to the inaccurate model estimation.
The top-down method transforms the moving object detection problem to Bayesian prediction. With the known prior probability of the object state, the problem can be solved by estimating the maximum posteriori probability continuously after obtaining the new measurement. In other words, Bayesian theory considers the vision-tracking problem as a "best guess" or "deduction" process, and usually adopts the state space approach to achieve vision tracking. The Classical Kalman filter [21] can only handle linear, Gaussian and unimodal situation. However, posteriori estimation is often non-linear, non-Gaussian and multimodal in practice. Therefore, EKF (Extended Kalman Filter) [22] is proposed to handle such cases. A particle filter [23] can also solve such non-linear problems. The top-down method utilizes the priori knowledge to construct a model for the detection problem. Then, the model's correction is verified with the practical image sequences. Since it has a solid theoretical foundation of mathematics and many mathematical tools that can be adopted, the top-down approaches are always the mainstream methods for vision detection. These approaches transform object detection problems to deduction and prediction problems. The assumption is that when the prior knowledge of deduction is correct, the deduction results will be correct. Otherwise, the results may be wrong. Thus, acquiring correct prior knowledge is very important. Existing approaches mostly initialize the objects manually to ensure the correctness of later subsequent detection and location, which is unrealistic in the practical applications. Therefore, in order to detect moving objects automatically for aerial video, reliable detection results from the bottom-up approach should be used as the deduction's priori knowledge to achieve a correct prediction.
In this paper, we propose a moving object detection algorithm based on multi-model estimation for aerial video. First, we segment the scene into several color blocks and estimate the dense optical flow. Then, we calculate an affine transformation model for each large area block and merge the consistent models. Finally, Graph Cuts [1,2] is utilized to classify the foreground pixels into different objects. Our method can not only handle the moving object detection in the complex multiple background scenarios with viaducts, but can also remove buildings, trees and other false alarms effectively. As a result, the segmentation and detection precision will be improved.

Multi-Model Estimation Based Moving Object Detection
In order to overcome the influence of the complex multiple background scenarios, this paper proposes a moving object detection algorithm for aerial video basing on multi-model estimation.
Firstly, the scene is segmented into several color blocks. Secondly, the affine transformation model between each background region in the current frame and the corresponding region in the previous frame is estimated basing on the dense optical flow. Thirdly, subordinate degree is calculated between each pixel and multiple background models to judge whether the pixel belongs to a moving object or not. Finally, moving objects are segmented by energy optimization method solved via Graph Cuts.

Algorithm Flow
The flowchart of the proposed framework is shown in Figure 2. Our approach mainly includes four steps: the overall perception of the scene, background model extraction, background region segmentation and moving object detection. First, the overall perception of the scene segments the scene into several color blocks and estimates the dense optical flow. Here, the Mean shift pyramid segmentation method from OPENCV (Open Source Computer Vision Library) is adopted for color blocks segmentation and the Gunnar Farneback algorithm [24] is used for calculating dense optical flow. Second, to confirm the multiple background models included in the scenario, background model extraction calculates the affine transformation models for multiple color blocks and merges the consistent models. Third, the background region segmentation will be transformed to the background and foreground binary classification, multiple background regions and multiple labels classification problem. This problem can be solved by the energy optimization method, which can achieve smooth and continuous global optimal solution. Fourth, after obtaining the foreground regions, we merge the blocks and remove false objects based on the moving consistency and the region proximity. Afterwards, the moving object detection is finished and the accurate detected results are obtained. The background model extraction, background region segmentation and moving object detection are introduced in Section 3.2, 3.3 and 3.4, respectively. The details are as follows.

Multi-Model Estimation
Estimating accurately the background model parameters of complex scenarios can ensure the correct scene perception, accurate object segmentation and robust object tracking. The current multi-model estimation methods, like JLinkage [25], do not need any prior segmentation information and can classify samples into multiple categories automatically, where each category corresponds to one model. However, this method only adapts to small samples and is unable to solve the big samples like multi-model estimation under complex scenarios. In the aerial video, the background blocks with consistent color often belong to the same background and the background area is much larger than that of objects. Therefore, this paper first segments the scenarios into color blocks and selects the blocks with the large area as the candidate background blocks. Afterwards, an affine transformation model is estimated for each background block.
Let us denote It and It+1 as the adjacent two frames. Then, the dense optical flow can be computed by the Gunnar Farneback algorithm [24]. We define OFXt and OFYt as transverse and longitudinal optical flow, respectively. The corresponding relationships are as follows: x y and ( ', ') x y form an optical flow pair. Next, we segment by using Mean shift algorithm, which segments the scene into multiple color blocks based on their color consistency. Then, the blocks whose area is larger than threshold min a T are selected as background blocks 1 2 { , ,..., where BNum represents the number of the color background blocks obtained by segmenting. The color blocks' area set is defined as and bi a is the number of the pixels included in the ith background color block.
Afterwards, each point in color background blocks and its optical flow point in the next frame obtained by optical flow method [24] form a point pair. Basing on the point pairs in each background block, the affine transformation model between the background block in the current frame and the corresponding region in the next frame is estimated via RANSAC (RANdom SAmple Consensus) method [12].
The affine transformation model set t M is composite of each background block's affine transformation model. The affine transformation model of the ith background block is denoted as i m , including translation, rotation, scaling, cropping and other atomic transformations. 0 a and 0 b represent shift amount between the background block in the current frame and the region in the next frame along the horizontal and vertical direction, respectively. The rest parameters represent composite of scaling, rotation and shearing. The current background blocks segmentation is based on color consistence, so single background may be segmented into several backgrounds due to color inconsistence. For the convenience of later scene analysis, we need to merge multiple background models according to the consistency between different background models. Thus, we define the projection error of the pair of points as follows: where ( ', ') x y denotes the optical flow point of pixel ( , ) x y in the consecutive frame. The projection error is the difference in pixels, between two points located in consecutive images that are related by the optical flow. If of points is an outlier for the ith background block. Then we calculate the connective matrix × BNum BNum R between the background blocks and the affine models as follows: 11 12 1 where ij r represents the accordance degree of the jth background block

Background Segmentation Based on Graph Cuts
We define the set of points that do not belong to the large background region as Ω. Then points of Ω can be judged as background region points or not based on the existing multiple background models. This paper proposes an energy minimization based algorithm for optimized classification. First, we define the scenario points belonging to 1 = + l BNum categories, where is the number of background models. We need to define and solve a label function : where data term d E represents the sum of classification cost of the points in Ω classified into different labels. The smooth term is a regularizer that encourages the neighboring pixels to share the same label. Therefore, the classification problem is transformed to minimizing ( ) E f and finding corresponding solution. However, minimizing ( ) E f directly is very difficult because the above classification problem is the coupling of foreground and background, and background and background classification. This paper decomposes the above problem into two optimized solution modules { , } = f fs fc : (1) optimizing fs for background segmentation; (2) optimizing fc for classifying different background categories. In the first module, in order to segment the background regions, we transform this optimized classification problem to solve the binary energy minimization. If a pixel belongs to background, its label is 0, otherwise 1. The energy function includes a one variable data term and pairwise smoothing terms, where data term represents the cost of labeling the pixels to the background. The smoothing term corresponds to the continuous smoothness prior of the background region. The Graph Cuts [1,2] is adopted for optimizing j i BNum and solving energy minimization problem. In the second module, the problem of classifying background points into different background models is transformed to a multi-labeling energy minimization problem, which can also be solved via Graph Cuts [1,2]. The data term of energy function represents the cost of tagging the points with the background labels. The smoothing term represents the background regions' continuity constraint.

Optimal Segmentation of Background Region
Following the above analysis, we need to seek a labeling function : Ω → fs Ls, : Ω → fs Ls. The background energy function is defined as follows:  Data term If a point belongs to the background region, it should be an inlier of one background model and its projection error corresponding to background model should be small, otherwise this point belongs to the foreground region and is the outlier to all the background models. Therefore, we choose the projection error to define the data term ( ) d E fs : Error p T IsI p otherwise (12) where ( ) i IsI p represents pixel p's inlier property projected in the model i m . If the property is 1, this pixel belongs to the inliers, otherwise the outliers.
( ) Inl p represents pixel p's background property. If property is 1, this pixel belongs to the background region, otherwise the foreground region. The penalty is given when pixel p is classified to the foreground point and ( ) 1 = Inl p . The classified cost is not 0 and ( ) (1 ( )) 1 − − = fs p Inl p . Similarly, the classification penalty will also be given when the pixel p is classified to the background point and ( ) 0 = Inl p .

 Smooth term
Smooth term ( ) s E fs is a regularizer that encourages the overall labeling smoothly [1,2]. The prior is that two neighboring pixels have a higher probability to be classified as background points together or foreground points together. Here, we adopt the standard four-connected neighborhood system and penalize the fact if two neighboring pixels' labels are different. Just as defined in Equation (15), the closer the minimum projection errors of the two neighboring pixels, the bigger the smooth cost of labeling them with different tags. Based on the design of above data term and smooth term, Graph Cuts is adopted to solve the minimization problem of ( ) E fs . Afterwards, background segmentation result is obtained. BNum . Similarly, we adopt the energy minimization framework for solving fc . The energy minimization of the background classification is defined as follows:

Optimal Classification of Different Backgrounds
 Data term Data term should reflect the subordinate degree between background pixel and multi-background models, and achieve minimum value if the pixel belongs to someone model. Projection error can satisfy above requests. Therefore, we define the cost function by using projection error as follows:

Moving Object Detection
The pixels classified as foreground pixels may come from true moving object, and may also belong to false alarms of parallax error caused by buildings and others. How to distinguish these two category points is the key of segmenting moving object accurately. As we know, when a moving object is compensated by the background model, the parallax error only causes by the object itself, which represents the absolute motion vector of the object. Then the object motion between two neighboring frames is approximately the linear motion. As a result, the motion vectors of the inliers belonging to one object are similar. As shown in Figure 3, the motion vectors of true object in the red bounding box are similar. In contrast, the buildings do not belong to any background and all the existing background models cannot compensate the parallax error caused by the platform motion. Therefore, no matter if it is compensated by any one of the background models, parallax error distributes without dissimilarity, as the false alarm in the blue box of Figure 3. According to the above analysis, we will first calculate the motion vectors of foreground blocks compensated by the background model, and determine the moving objects by analyzing similarity of the motion vectors. The final foreground color blocks 1   ObNum . However, these blocks are segmented using the color consistence. Since the object color may be inconsistent, sometimes an object will be segmented into several blocks. To overcome this drawback, we need to merge these foreground blocks. We calculate the adjacent matrix a v a v a a and the corresponding variance is recalculated. Nb , A and V also need to be recalculated.
Afterwards, merging is repeated until no foreground blocks can be merged. The final merged results are the moving object detection results.

Experimental Results and Analysis
In order to evaluate the proposed multi-model estimation based moving object detection algorithm for aerial video, we perform the comparison experiments on the public DAPAR VIVID (Defense Advanced Research Projects Agency, Video Verification of Identity program) and KIT AIS (Karlsruher Institut für Technologie Aerial Image Sequences) Data Set databases. In DAPAR VIVID database [26], the EgTest01 dataset contains many moving cars but the background is relatively simple. In KIT AIS Data Set [27], shooting frame rate is 1FPS and it includes viaducts, overpasses, buildings, trees and other complex scenarios, which is very challenging for the moving object detection algorithms of aerial video. is the max threshold for the object area. If it is set too small, then true object will be considered as small background blocks. Otherwise, the objects close to each other would be considered as one with large value for max a T . In our experiments, we set max 800 = a T to detect vehicles on the road. e T is the projection error threshold. If a pixel's projection error for a given affine model is bigger than e T , then it is considered to be an outlier for the model. Otherwise, if its projection error is smaller than e T , it is an inlier for this model. The smaller e T can bring more outliners and meanwhile cause more false alarms. The bigger e T sometimes makes the algorithm miss true moving pixels. For the balance of false alarms and missing, we set 3 = e T in our experiment. The variance threshold T σ determines which foreground blocks are true object blocks and which blocks are false alarms. The smaller the value of T σ , the fewer false alarms we detect and meanwhile the more likely we miss the true moving object. The larger T σ would cause more false alarms. We set 4 = T σ in our experiment for the best performance. The smooth threshold s τ defines the max smooth cost of labeling two neighboring pixels with different tags. The larger s τ brings a smoother labeling map and object missing is more likely to occur. The small s τ decreases the smoothing effect. We set 4 = s τ in our experiments.
The detection method in [14] is the most representative method in which Harris features are abstracted for registration and frame difference is used to detect moving objects. Shen et al. [20] proposed a moving object detection method for aerial video based on spatiotemporal saliency. This method can accurately handle moving target detection under simple scenarios. However, it has not adopted multiple background analysis for the scenarios, and detection missing and false alarms will happen frequently in complex scenarios. As there are no published codes for the approach in [14,20] on the web, we implement these two approaches for comparison.
We compare our algorithm with the method in [14] and Shen [20] on the StuttgartCrossroad01 dataset of KIT AIS Data Set. The results are shown in Figure 4. This dataset contains overpasses and multiple background complex scenarios as well as complex elements, such as trees and shadow, which will influence the detection results. All the factors will bring substantial challenge to the detection algorithms. In Figure 4, the images from top to bottom show the detected results of the 1st, 5th, 9th and 12th frames. The images from left to right are separately, the detection results of this paper, the segmentation results of this paper, the detection results by [14], the detection results of Shen [20], and the ground truth. In the first column of Figure 4, the objects in blue boxes are the detection results of this paper. The objects in red boxes are stationary targets. The detection results show our approach can segment and detect moving objects accurately in the complex background situation of overpasses. Since both of the approaches, in [14,20], cannot perceive multiple background of the scenario and cannot obtain accurately background information, the situations such as inaccurate moving segmentation and false alarms will happen. We can see these situations in the third and fourth column of Figure 4. The blue bounding boxes show the detected objects. The yellow boxes show the false detection and missing targets. Although, the method in [20] performs better than the method in [14], false alarms and inaccurate detections occur frequently in both of these two methods. The ground truth published on the web marks all the vehicles in the scene, including both moving objects and stationary vehicles.  Figure 5 shows the comparison results on the MunichCrossroad01 dataset. The characteristic of this dataset is that the false objects of parallax error caused by the trees and other elements occupy a large proportion of the image area. In Figure 5, the images from top to bottom show the detection results of the 1st, 7th, 13th and 18th frames. The results in the first column show that our approach can handle the moving object detection in scenarios with many trees and overcome the parallax error caused by trees. In contrast, the traditional detection methods [14,20] based on registration will be influenced by trees and cannot estimate the scene model accurately. Therefore, as shown in the third and fourth columns of Figure 5, the detection rate of traditional method is low and the false alarm is high. Figure 6 shows the detection results on Munich Crossroad02 dataset. This dataset includes many buildings. The transitional methods [14,20] cannot accurately estimate the background parameters and obtain the correct detection and segmentation results in this situation. As shown in the third and fourth columns of Figure 6, many false alarms and missing detections occur. In contrast, the results in the first and second columns show the detection and segmentation results of our paper. The results demonstrate our approach can perceive scenarios and detect moving objects correctly due to multiple background model estimation. As shown in Figures 4-6, this paper performs much better than traditional detection algorithm basing on registration. Our approach can analyze the multiple background models in scenarios and detect the moving objects accurately. However, since we adopt mean shift color segmentation and pyramid dense optical flow to perceive the multiple background models, the algorithm's efficiency still needs to be improved and more efficient multiple background model estimation algorithms are required. Additionally, this paper focuses on vehicle-sized objects and cannot detect the point objects like humans. We also do not add any special treatment for shadow, so the moving objects after segmenting may contain shadow, which is also the future work.   [14,20].
In order to quantitatively analyze the detection accuracy of this paper, we define recall R, accuracy P and comprehensive evaluation indicators 1 F as follows: 1 2 / ( ) = + F PR P R (24) where ObNum , DObNum and DNum are the object number of the ground truth, correct detection and detected object number, respectively. If a detected object's overlap rate with a true object is above 0.5, then it is considered as a correct detection. Otherwise, it is a false alarm. In practical applications, higher R and P are desired, but these two indicators are contradictory in some cases. 1 F integrates the results of R and P. Higher 1 F indicates that the experimental method is more effective. Figure 7 shows the comparison results of our paper, the traditional method by [20] and the traditional method by [14]. The results from left to right are the statistical results of DAPAR VIVID EgTest01, StuttgartCrossroad01, MunichCrossroad01 and MunichCrossroad02 of KIT AIS Data Set. As shown in Figure 7a, these three algorithms can both achieve high detection rate under simple background and their detection precisions are similar. However, 1 F of our algorithm under complex background is higher than the methods in [14,20], i.e., on StuttgartCrossroad01 dataset, 1 F of our result is 0.949, which is higher than 0.808 of Shen [20] and 0.611 of the method in [14]. In MunichCrossroad01 dataset, our approach's 1 F is 0.937, which is higher than 0.821 of Shen [20] and 0.625 of the method in [14]. These results show the significant superiority of our algorithm, as shown in Figure 7b-d.

Conclusions
This paper is mainly for the moving object detection problem under complex scenarios for aerial videos. We propose a novel moving object detection algorithm based on multi-model estimation and optimized classification. First, we calculate the dense optical flow of the scene and do color segmentation basing on mean shift to capture the perception of the whole scene. Secondly, we calculate affine transformation models as the multiple background models for each color block with a large area. Through multiple background model cross-validation and merger, accurate multi-model parameters of scene can be obtained. Thirdly, in order to obtain the multiple background segmentation results of the scene, the background points are segmented into multiple background models by using energy optimization method solved via Graph Cuts. Finally, we calculate subordinate degree from foreground regions to multi-background models, remove the false alarm and segment moving object accurately.
Since we break through the single background constraint and adopt multiple background models, our algorithm can handle the moving object detection under complex multiple background scenarios. Moreover, our algorithm can segment the background and foreground regions accurately due to the adoption of Graph Cuts, optical flow information and continuous smooth constraints. The experimental results on many aerial videos indicate that our algorithm can correctly perceive multiple background information of the scene and detect moving object accurately in the complex scenes with multiple backgrounds, buildings and other objects that produce parallax.