A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs

Zhang, Zhewei; Jing, Tao; Ding, Bowen; Gao, Meilin; Li, Xuejing

doi:10.3390/app9132670

Open AccessArticle

A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs

by

Zhewei Zhang

^1,*,

Tao Jing

¹,

Bowen Ding

²,

Meilin Gao

¹ and

Xuejing Li

¹

Institution of Electronic Information Engineering, Beijing Jiaotong University, Shang Yuan Road No. 3, Haidian District, Beijing 100044, China

²

Artificial Intelligence Research Institute of China Unicom, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2019, 9(13), 2670; https://doi.org/10.3390/app9132670

Submission received: 25 March 2019 / Revised: 22 May 2019 / Accepted: 3 June 2019 / Published: 30 June 2019

(This article belongs to the Special Issue Advanced Intelligent Imaging Technology)

Download

Browse Figures

Versions Notes

Abstract

:

Detecting the Region of Interest (ROI) for video clips is a significant and useful technique both in video codecs and surveillance/monitor systems. In this paper, a new model-based detection method is designed which suits video compression codecs by proposing two models: an “inter” and “intra” model. The “inter” model exploits the motion information represented as blocks by global motion compensation approaches while the “intra” model extracts the objects details through objects filtering and image segmentation procedures. Finally, the detection results are formed through a new clustering with fine-tune approach from the “intra” model assisted with the “inter” model. Experimental results show that the proposed method fits well for real-time video codecs and it achieves a good performance both on detection precision and on computing time. In addition, the proposed method is versatile for a wide range of surveillance videos with different characteristics.

Keywords:

ROI detection; global motion detection & compensation; image segmentation; decision tree; camera status classification

1. Introduction

Detecting Region Of Interest (ROI) for video sequences is widely applied both in surveillance/monitor systems and in video encoders. In recent decades, people attempt to define the concept of ROI based on Human Visual System [1]. Also, an ROI-based video encoding mode is adopted both in previous research [2,3] and in latest video codecs such as VPX [4] or HEVC [5]. The notion of ROI is within an evolutionary process, In earlier works [6,7], researchers treat faces as ROI details and adjust quantization parameters (QP) in video compression for ROI encoded blocks. Other works such as [8,9] regard moving or high contrast image segments as ROI details for video codecs. As the concept of the “objected-based” video coding raised from MPEG-4 [10], researchers begin to focus on moving objects extraction and encoding. Hence, ROI is defined more likely as the foreground moving objects in video clips. During the recent five years, many fore-background separation methods are developed based on moving objects detection. The low-rank decomposition method is able to solve the change detection problem by outlier pursuit algorithm, in which the changing components are stored in a sparse matrix. Principle Component Analysis (PCA) has been proposed as a state-of-the-art method for low-rank matrix recovery [11]. However, the classical PCA can be easily corrupted by noise. With the rapid development of artificial intelligence, objects discovery is also used in navigation system designs for autonomous cars, in which computers would recognize objects and make the planning without specialized sensors [12]. Therefore, it is essential to design a detection algorithm for both accuracy and efficiency, so that typical objects in videos or images can be automatically discovered real time to help the computer understand their contents. In addition, object detection can be embedded into the scope of motion estimation (ME) for video coding architectures in our previous works [13,14,15]. Specifically, the detection method based on ME can be embedded into some ME processors. For example, the custom-instruction with the combination of synchronous dynamic random access memory (SDRAM) and on-chip memory architecture-based Nios II processor [16,17], and the FPGA architecture-based processor [18].

1.1. Related Work

One of the state-of-the-art methods for ROI detection is the Gaussian Mixture Model (GMM) [19,20,21], in which fore-background are modeled with some weighted mixture components. Another typical method is named as sparse and low-rank decomposition [11], in which moving objects are computed as the sparse matrix. Derived from this idea, principle component analysis (PCA) [22] is used to efficiently exploit the low-rank structure within errors or outliers, and it is well extended by RPCA [23,24,25] where the robustness is improved to work even in the presence of large errors. Also, a probabilistic model for RPCA is proposed in [26] which provides a good visual results. Other low-rank models are treated as the variations of RPCA such as Grouse [27], RASL [28] and SSGodec [29], in which the computing costs are optimized. Moreover, texture-based objects extraction is proposed in Chan et al. [30,31,32,33]. In these works, the authors cluster and extract dynamic textures by the HEM-DTM approach corresponding with Hidden Markov Model (HMM). In addition, background modeling-based approaches such as Vibe [34], LBP [35], SOBS [36], etc. have introduced different background-model update policies based on time, color-distance, and some reliable factors. Meanwhile, the machine learning approaches such as unsupervised clustering [37,38], Markov Random Field (MRF) for graph cut [39,40,41,42] also achieve a good performance on ROI extraction and detection.

The other parts of research for ROI detection methods focus on deep-learning based method. Architectures of deep neural network with repeated convolution and pooling layers have become the dominant idea for high-quality objects recognition and detection [43]. Szegedy et al. [44] introduced the idea of bounding box masks regression for object detection. Similar works on MultiBox [45,46], improved the previous work by enhancing efficiency and scalability. Ren et al. [47] improved the architecture of R-CNN by proposing a Region Propsal Network (RPN) that shares full-image convolutional features with the detection network. He et al. [48] designed a framework of Spatial Pyramid Pooling (SPP) which enhances robustness to size variations in the network. The paper also claimed that the detection speed was improved compared to R-CNN. YOLO by Redmon et al. [49] used a single network to predict bounding boxes and class labels for objects. Hence, the detection speed has a significant breakthrough. Later, Single Shot Multibox Detector (SSD) [50] by Liu et al. discretized the output space of bounding boxes into a set of default boxes over different resolutions per feature map location. Both the detection speed and the accuracy have been improved since it only uses a single deep neural network with multiple resolutions of feature maps for boxes regression. Other R-CNN based methods [51,52,53] attempted to optimize the network structure by introducing dense connections, reusing computations, applying per-region subnetwork structures to improve the detection performance. However, the multi-stage pipeline training framework of R-CNN brings expensive training costs in space and time. Meanwhile, features extracted from each object proposal are slow at test time [54]. As for machine learning-based methods, a state-of-the-art motivation is to detect objects by using pattern mining with a dictionary model [55]. Our previous work [56] proposed a word-topic model based on Latent Dirichlet Allocation (LDA) model, in which the image features are transformed into corresponding words for ROI object description. Also, an Anchor-box fine-tuning algorithm is proposed to improve the detection precision. It makes a trade-off between the computing costs and model accuracy.

1.2. Overview and Contributions

According to the related research listed, a motivation is to propose a method which suits video codecs like HEVC/VPX. In this paper, a new ROI modeling-based detection scheme is designed which copes with video compression standards by proposing two ROI models: “inter” and “intra”. The “inter” part includes camera status detection and global motion compensation (GMC). The “intra” part includes image segmentation and objects filtering. At last, the detection results are combined together based on the output of two models. Please note that each model can only observe a partial information: The “inter” corresponds with motion information and the “intra” with objects information, as depicted in Figure 1. Previous works [57,58,59,60] introduce an adaptive feature-based methods (FBMs) which estimates the camera motion with error-least matching points assisted by the Single Homographic (SH) matrix. However, minimizing the movement errors of matching points brings computational costs that reduce the speed in real-time systems. To overcome this disadvantage, we propose a fast and approximate global motion compensation way which uses a flexible sampling set and a decision tree model for classification in the “inter” model. The motion vector blocks for objects are acquired via global motion compensation techniques. As for the “intra” model, a Kernel Fuzzy-C Means image segmentation approach is proposed to cluster pixels into specific groups. Then, we use a new motion area pruning method to eliminate the “noise” motion blocks extracted through the “inter” model. Please note that the purpose of “intra” model is to cluster objects by colorspaces and combine them with the motion information separately owing to the connectivity correlation of each cluster.

The contributions of this paper are illustrated below: First, we address the ROI detection problem by designing the “inter” and “intra” models. In the inter model, we first propose a sampling criteria which accurately samples the background motion blocks. Then, a fast and approximate way for global motion detection and compensation techniques based on the camera status and the decision tree model are proposed. In the intra model, a new adaptive and accurate image segmentation method is designed for clustering foreground objects. We innovate combining two models as the finial detection results by developing a combination algorithm based on color correlation and distance. A motion area pruning algorithm is proposed to eliminate noise motion blocks extracted via GMC. Besides, we test our method on different open datasets. Performance evaluations on both the change detection dataset [61] and official HEVC/AVC test sequences [62] shows our method performs well in terms of detection precision and computing time. The rest of this paper is organized as follows. In Section 2, we elaborate the details of the two proposed models of ROI and the detection algorithm. In Section 3, experimental results are displayed with sufficient analysis for visual effect comparisons and performance. Finally, conclusion is drawn in Section 4.

2. Detection Model

2.1. Foreground Motion Blocks Extraction

2.1.1. Sampling Areas Determination

We propose a sample-area criteria for a single frame illustrated in Figure 2. There are eight sample areas in a frame, namely,

X_{1}, X_{2}, \dots X_{8}

, which are located at four diagonal positions and four middle positions in the boundary. Each sample area contains twenty-five

8 \times 8

blocks whose types include both inter and intra, and the size of each sample area is

40 \times 40

. The two green rectangles in Figure 2 are represented as the outer and the inner boundary respectively, and the center point in this frame is named as the anchor point. For mathematical analysis, we define sampling set

X = {X_{1}, X_{2}, \dots X_{8}}

, and each block in

X_{i} (1 \leq i \leq 8)

is represented as

x_{i 1}^{i} (1 \leq i 1 \leq 25)

. Also, we define

I_{i}

as the intercoded block set and set

I_{i}^{C}

be the intracoded block set for sampling area

X_{i}

. Consider the motion vectors (MVs) for the filtered sampling blocks of the previous frames has a distribution named

q (x)

, and the current unfiltered sampling blocks has a distribution named

p (x)

, we can compute Kullback-Leivler Divergence (KLD) between them as:

D (p ∥ q) = E_{x \sim P} [log \frac{P (x)}{Q (x)}] = \int p (x) log \frac{p (x)}{q (x)} d x,

(1)

in which

E_{x \sim P} [\cdot]

is a statistical average value. To minimize

D (p ∥ q)

, a trick is to quantize these MVs into

X

into l levels. Let

m v_{j}^{i} (j \in I_{i})

denote the jth block’s MV in

I_{i}

, the maximum and minimum MVs are represented as:

\{\begin{matrix} m v_{m i n} (x) = min_{⋃ I_{i}} (min_{j \in I_{i}} (m v_{j}^{i} (x))); m v_{m i n} (y) = min_{⋃ I_{i}} (min_{j \in I_{i}} (m v_{j}^{i} (y))) \\ m v_{m a x} (x) = max_{⋃ I_{i}} (max_{j \in I_{i}} (m v_{j}^{i} (x))); m v_{m a x} (y) = max_{⋃ I_{i}} (max_{j \in I_{i}} (m v_{j}^{i} (y))) \end{matrix}

(2)

m v (x)

and

m v (y)

stand for the horizontal and vertical components. The MV step

Δ_{s}

equals to

∥ m v_{m a x} - m v_{m i n} ∥ / l

, where

∥ \cdot ∥

is the

L_{2}

norm. Let

r_{s}

be the sth level of the quantized MVs, its probability has the following expression:

P_{r} (r_{s}) = \frac{n_{s}}{n}, 1 \leq s \leq l .

(3)

Here,

n_{s}

means the number of MVs at level s and n stands for the total quantized MVs. Then, (1) can be rewritten in a discrete way:

D (P_{r} ∥ P_{r}^{'}) = \sum_{s} P_{r} (r_{s}) log \frac{P_{r} (r_{s})}{P_{r}^{'} (r_{s})},

(4)

in which

P_{r}^{'}

is the historical MV distribution for previous frames. Now, we elaborate the proposed sampling block filter algorithm (Algorithm 1):

Algorithm 1 Sampling block filter algorithm

1:: Initialize sample set $X = \emptyset$ , $κ = 1$ , l, the expected sample size $S_{z}$ , and the decay factor $γ = 0.9$ .
2:: while $| X | < S_{z}$ do
3:: Reset $X = \emptyset$ .
4:: for $s = 1, \dots, l$ do
5:: Compute the relative deviation $d = | \frac{∥ P_{r} (r_{s}) - P_{r}^{'} (r_{s}) ∥}{P_{r} (r_{s})} |$ .
6:: for each intercoded block $x_{j}^{i}$ whose MV belongs to level k do
7:: Add $x_{j}^{i}$ into $X$ with an acceptance probability $P_{a} = 1 - κ d$ .
8:: end for
9:: end for
10:: Decrease $κ$ by $κ = γ κ$ .
11:: end while
12:: Update $P_{r}^{'} (r_{s})$ by rules in (5).

Please note that the algorithm first calculates the relative deviation d between

P_{r} (r_{s})

and

P_{r}^{'} (r_{s})

at each level and accept each block with a probability according to

δ

. Next, it maintains the lower bound of the sampling size by reducing

κ

during each loop. Finally, it updates

P_{r}^{'} (r_{s})

by the following rule:

P_{r}^{'} (r_{s}) = \frac{\sum_{i} \sum_{j \in I_{i}} δ (x_{j}^{i} \notin I_{i}^{'}) δ (Q (m v_{j}^{i}) = r_{s}) + n_{s}^{'}}{\sum_{i} \sum_{j \in I_{i}} δ (x_{j}^{i} \notin I_{i}^{'}) + n_{s}^{'}} .

(5)

where

δ (x)

is the indicator function:

δ (x) = x \equiv True ? 1 : 0

,

I_{i}^{'}

is the previous sampled intercoded block set of the ith area,

n_{s}^{'}

stands for the previous number of sampled blocks, and

Q (m v)

represents the quantized level of a MV. Please note that the updating rule recalculates

P_{r}^{'} (r_{s})

by updating the previous and current sampling sets with different sampling blocks. At the first P-frame, all the intercoded blocks in

X_{i}

are added into sampling set

X

. Then, the filter algorithm reduces the “error” blocks via an updating process to get the correct sampling blocks for global motion detection.

2.1.2. Camera Status Detection

The global motion estimation (GME) method handles for these situations specifically: stable, camera panning, camera zoom in/out and others. A status assemble

H = {C_{s}, C_{p}, C_{z i}, C_{z o}, C^{'}}

is defined according to the listed situations. Each camera state and its decision model are demonstrated in details:

Panning
To determine Camera Panning, an intuitive approach is to compare the norm and direction of MVs among different sampling areas. For each area $X_{i}$ , its average MV value is computed as follows:

$\{\begin{matrix} \bar{m v_{i}} (x) & = \frac{1}{| I_{i} |} \sum_{j \in I_{i}} m v_{j} (x) \\ \bar{m v_{i}} (y) & = \frac{1}{| I_{i} |} \sum_{j \in I_{i}} m v_{j} (y) \end{matrix}$

(6)

We define MV set $MV (X) = {\bar{m v_{1}}, \dots, \bar{m v_{8}}}$ . For normalization, we have:

$\{\begin{matrix} \bar{m v_{i}} (x) & = \bar{m v_{i}} (x) / (m v_{m a x} (x) - m v_{m i n} (x)) \\ \bar{m v_{i}} (y) & = \bar{m v_{i}} (y) / (m v_{m a x} (y) - m v_{m i n} (y)) \end{matrix}$

(7)

and their variance without the minimum and maximum values are listed below:

$\begin{matrix} \{\begin{matrix} V a r (m v (x)) = \frac{1}{N} \sum_{i \in MV (X)} {(\bar{m v_{i}} (x) - \bar{m v} (x))}^{2} \\ V a r (m v (y)) = \frac{1}{N} \sum_{i \in MV (X)} {(\bar{m v_{i}} (y) - \bar{m v} (x))}^{2} \end{matrix} \end{matrix}$

(8)

Please note that $\bar{m v} (\cdot)$ denotes the average values of $\bar{m v_{i}} (\cdot)$ , and N equals to the size of $MV (X)$ . Meanwhile, the angle components of $MV (X)$ are taken into consideration as well. Let $θ_{i}$ be the angle component for $\bar{m v_{i}} (x)$ :

$θ_{i} = arctan \frac{| \bar{m v_{i}} (x) |}{| \bar{m v_{i}} (y) |},$

(9)

and its variance is followed by:

$V a r (θ) = \frac{1}{N} \sum_{i} {(θ_{i} - \bar{θ})}^{2} (1 \leq i \leq N) .$

(10)

Then, a score parameter $S_{c p}$ for judging camera panning is formed as:

$S_{c p} = [1 - \frac{1}{2} (V a r (m v (x)) + V a r (m v (y)))] {| ln (1 - V a r (θ)) + 1 |}^{- 1} .$

(11)

$S_{c p}$ is within a range of (0,1) since normalization process is done before calculation, and $ln (1 - V a r (θ))$ is treated as the penalty function. When $V a r (θ) \to 1$ , $S_{c p} ≃ 0$ since we attempt to put more weight on the variance of $θ$ to decide the camera panning circumstance.
Zoom-In and Out:
In this status, each sampling area has a different direction of motion vector. An illustration for the zooming model is displayed in Figure 3.
Obviously, the directions of motion vectors for these sampling areas can be regarded as a factor to determine the zooming in and out situation. Also, the norm of motion vectors is relevant with the distance between the anchor point and each center of the character block. To be specific, we categorize $X$ into two groups: $X_{d i a g} = {X_{1}, X_{3}, X_{5}, X_{7}}$ and $X_{m i d} = {X_{2}, X_{4}, X_{6}, X_{8}}$ . Also, define a set of vectors $F = {{\vec{f}}_{1}, {\vec{f}}_{2}, \dots, {\vec{f}}_{N}}$ that:

${\vec{f}}_{i} = (X_{o} - X_{i} (x), Y_{o} - X_{i} (y)) X_{i} \in X,$

(12)

where $(X_{o}, Y_{o})$ is the coordinate of the anchor point o, and $(X_{i} (x), X_{i} (y))$ is the mean coordinate of $X_{i}$ . For camera zoom-in status, the inner product of $MV (X)$ and $\vec{f}$ is a positive value while it is negative in zoom-out status. We can compute the offset angle $Ω$ for zoom in and out by:

$Ω_{i} = arccos \frac{\bar{m v_{i}} \cdot {\vec{f}}_{i}}{∣ \bar{m v_{i}} ∣ ∣ {\vec{f}}_{i} ∣} .$

(13)

Apparently, $Ω_{i} > 0$ means $X_{i}$ is in the direction of zoom-in, otherwise zoom-out. We keep $Ω$ in a belief range $[Ω_{l}, Ω_{r}]$ , in which the max difference between $Ω_{r}$ and $Ω_{l}$ is set assigned with a constant value (marked in Figure 3 as gray solid arcs). Suppose $Ω$ obeys the normal distribution: $Ω \sim N (μ, σ)$ . Then, the likelihood factor $L$ is given as:

$\{\begin{matrix} L (Ω) & = 2 π σ^{- \frac{1}{2}} exp (\frac{{(Ω - μ)}^{2}}{2 σ^{2}}) \\ \tilde{L} (Ω) & = L (Ω) / L (μ) . \end{matrix}$

Please note that $\tilde{L} (Ω)$ is normalized to unify the range. Besides, a ’hit and miss’ voting scheme is induced: for the zoom-in situation, if $Ω_{i} > σ$ , we say that sampling area i misses the zoom-in contribution, otherwise hit; for the zoom-out situation, if $π - Ω_{i} > σ$ , we say miss, otherwise hit. Now, the voting rate for zooming situation is declared as:

$V (X) = \frac{1}{N} \sum_{i} δ (Ω_{i}, σ) 1 \leq i \leq N$

(14)

where $δ (Ω_{i}, σ)$ is also an indicator function:

$δ (θ_{i}, σ) = \{\begin{matrix} 1 & Ω_{i} < σ a n d \bar{m v_{i}} \cdot {\vec{f}}_{i} > 0 \\ - 1 & π - Ω_{i} < σ a n d \bar{m v_{i}} (x) \cdot {\vec{f}}_{i} < 0 \\ 0 & o t h e r w i s e . \end{matrix}$

(15)

When $| V (X_{i}) | \geq p$ , the voting rate’s error is given as $1 - p$ . It can be seen that the sampling areas with opposite motion directions show more negative contributions on $V (X)$ than ones with higher variances, which is reasonable since motion direction plays an important role in zoom judging. Another factor that influences zoom judging is the norm of $MV (X)$ , it satisfies:

$E (| \bar{MV (X_{d i a g})} |) ≃ \frac{h}{\sqrt{w^{2} + h^{2}}} E (| \bar{MV (X_{m i d})} |),$

(16)

where w and h are width and height of a frame, $| \bar{MV (X_{d i a g})} |$ stands for the average norm of MVs in $X_{d i a g}$ . Observe that this equation is the approximate one, and its relative error $ε$ can be defined as:

$ε = \frac{2 | E (| \bar{MV (X_{d i a g})} |) - \frac{h}{\sqrt{w^{2} + h^{2}}} E (| \bar{MV (X_{m i d})} |) |}{E (| \bar{MV (X_{d i a g})} |) + \frac{h}{\sqrt{w^{2} + h^{2}}} E (| \bar{MV (X_{m i d})} |)}$

(17)

Finally, the score for deciding camera zooming is given below:

$S_{z} = \frac{(1 - ε) V (X)}{N} \sum_{i} \tilde{L} (Ω) (1 \leq i \leq N)$

(18)

When $S_{z}$ is negative, it is inclined to zoom-out situation, otherwise zoom-in. In addition, we have to set a threshold $t_{h}$ that must satisfy $| S_{z} | > t_{h}$ for the zooming circumstances.
Stable and Others:
For a stable situation, it is easy to judge by checking whether $E (| V (X) |) ≃ 0$ or $V a r (| MV (X |) ≃ 0$ . Besides, the sampling set $X$ contains most of intracoded blocks ( $| I^{C} | ≫ | I |$ ), which indicates the stable situation as well. The last circumstances $C^{'}$ is named as “rest” when it does not satisfy any of the above circumstances. Please note that $C^{'}$ may consist some important situations that we have ignored in previous detection steps. For example, the combination of panning and zooming, which occurs frequently in video recording. It is hard to decompose both panning and zooming motion vector component since the panning component can have any directions and norms. An alternative way is to simplify this problem by extracting the horizontal and vertical panning motion vectors from the composed one separately. We consider the upper-right corner sampling area in Figure 4 for instance, solid lines represent the extracted motion vectors, and they can be forced to decompose into the zoom-in and the retrieved horizontal/vertical components (dotted lines and red solid lines). The norm of the zoom-in component is set as a fixed value to make sure the retrieved motion vector is horizontal or vertical. We re-compare the retrieved horizontal and vertical components for the sampling areas using the camera panning model; if it satisfies the conditions, the status of camera is both panning and zooming. On the other hand, the zoom-out situation follows the same rule as the zoom-in.

We propose a decision tree model is designed for global motion detection, as depicted in Figure 5.

Also, a related algorithm is written in Algorithm 2.

Algorithm 2 Global motion decision algorithm

1:: if ( $E (| V (X) |) ≃ 0$ and $V a r (| V (X) |) ≃ 0$ ) or $| I^{C} | ≫ | I |$ , return $C_{s}$ .
2:: Compute $S_{c p}$ using (11), if $S_{c p} > t_{h 1}$ , return $C_{p}$ .
3:: Compute $S_{z}$ using (18), if $| S_{z} | > t_{h 2}$ and $S_{z} > 0$ return $C_{z i}$ .
4:: else if $| S_{z} | > t_{h 2}$ and $S_{z} < 0$ return $C_{z o}$ .
5:: else return $C^{'}$ .

The tree is mainly classified into three categories: stable, panning, zooming, and they are ordered by occurrence probabilities. Threshold

t_{h 1}

and

t_{h 2}

are induced to control the detection precision.

2.1.3. Global Motion Compensation

A global motion compensation procedure is recommended for each block to retrieve the foreground MVs. Since the coding unit (CU) has different resolutions both in VPX and in HEVC, it is recommended to unify the block size for processing as

8 \times 8

. For each block

b_{i}

in a frame, its compensated motion vector is computed through:

m v_{r} (b_{i}) = m v (b_{i}) - m v_{g} (b_{i}),

(19)

where

m v_{r} (b_{i})

and

m v_{g} (b_{i})

denote the real and global motion vector respectively. The key problem is to compute

m v_{g}

for each block

b_{i}

in different camera circumstances. Consider each element

h \in H

, for

h = C_{s}

,

m v_{r} (b_{i}) = m v (b_{i})

; For

h = C_{p}

,

m v_{g} (b_{i}) = E (MV (X))

; For

h = C_{z o}

and

h = C_{z i}

, according to Figure 3, it might be asserted that

m v_{g} (b_{i})

and

\vec{f_{b i}}

have the same or opposite directions, in other words,

\frac{m v_{g} (b_{i}) \cdot \vec{f_{b i}}}{| m v_{g} (b_{i}) | | \vec{f_{b i}} |} ≃ \pm 1

. When

h = C_{z i}

,

m v_{g} (b_{i}) = α \vec{f_{b i}}

, and when

h = C_{z o}

,

m v_{g} (b_{i}) = - α \vec{f_{b i}}

.

α

denotes the scaling factor, which has the following expression:

α = \frac{1}{N} \sum_{i = 0}^{N} \frac{| \bar{m v_{i}} |}{| {\vec{f}}_{i} |} .

(20)

Last but not least, if

h = C^{'}

, we check whether the combination of panning and zooming occurs first, if so,

m v_{g} (b_{i})

is composed of the two motion vectors and

m v_{r} (b_{i})

can be acquired by (19) and (20). Otherwise, it is recommended to check out the previous result of h in the reference frame buffer for the current one. If the current one is I frame (has no reference frame), or

h_{r e f} = C^{'}

without combining the cases of panning and zooming, we have to ignore

m v_{g} (b_{i})

and leave it as zero since lacking of global motion information. If

h_{r e f} \neq C^{'}

, we use

m v_{g} {(b_{i})}^{'}

of the reference frame instead of this one as the last choice.

2.1.4. Discussion

Figure 6 displays some extracted motion blocks for ROI foreground objects via different approaches. Each method is introduced in brief:

Gradient-Descent (GD) [63]: It aims to compute the gradient by Newton-Raphson method for the error distance between the true and the estimated MVs, then it updates parameters based on the gradient descend direction so that the error distance is reduced.
Least-Sum-Square M-Estimator (LSS-ME) [64]: It involves formulating the error distance by an over-determined regression: $A^{T} A m = A^{T} b$ , where $A \in R^{2 N \times 8}$ , $b$ is a $2 N$ vector of transformed coordinates $(x^{'}, y^{'})$ , and N is the total number of MVs. It also uses iterative outlier rejection through a robust M-Estimator to estimate motion components $m$ .
RANSAC [65]: This method induces a statistical method for GME in different computer vision problems. It computes the homography matrix via a large number of iterations to achieve the maximum transformed matching points between the current and the reference frames.

One can see that the proposed method retrieves the foreground MV blocks more accurately than others by eliminating “noisy” background blocks. Instead of using the regression (GD, LSS-ME) or statistical (RANSAC) methods to estimate the global motion parameter

m

, the proposed decision tree model is efficient on judging the global camera status. The average running time (sec/frame) performances are tested on OpenCV (C++) with a 2.8GHZ Core i7 CPU. Table 1 displays that the proposed method achieves a better result in time performance since high computation costs of matrix SVD and iterations are substituted by the tree model with

O (n)

complexity.

To justify how these thresholds influence the classifier’s performance, several sequences are collected for test with different characteristics. These sequences are pre-analyzed by the photographer’s comments for status judgements, and any decision result which obeys the objective comments is treated as the misjudgement. For performance evaluation purpose, we extract partial frames in each sequence to ensure that each sequence has at least two camera states. Frames with main status are regarded as positive samples while others are negative samples. Without loss of generality, we select the number of negative samples nearly the same as the positive ones. For example, in foreman, frames between 47–81 are commented as the stable state and 82–115 are commented as the panning state. Table 2 displays some performance indexes (e.g., recall, precision, F-score) for different settings of threshold

t h 1

and

t h 2

. The panning and zooming groups are tested with six sequences, and the results are acquired by computing their average values. One can see that an optimal combination of (

t h 1

,

t h 2

) is (0.3, 0.5), which indicates that the classifier is able to make acceptable decisions for camera statuses.

2.1.5. Marking Motion Areas

After acquiring the global compensated motion data of foreground ROI objects, a MV clustering method is proposed to segment blocks into motion areas. For each ROI block, we apply the target clustering function J based on the following definition:

J = \sum_{i = 1}^{N} \sum_{j = 1}^{K} μ_{i j} (∥ {\tilde{x}}_{i}^{p} - {\tilde{c}}_{j}^{p} ∥^{2} + ∥ {\tilde{x}}_{i}^{m v} - {\tilde{c}}_{j}^{m v} ∥^{2} + ∥ {\tilde{x}}_{i}^{I} - {\tilde{c}}_{j}^{I} ∥^{2}),

(21)

where

{\tilde{x}}_{i}^{p}

,

{\tilde{x}}_{i}^{m v}

and

{\tilde{x}}_{i}^{I}

denote the normalized vectors of position, motion vector, pixel intensity for block i correspondingly. Please note that

{\tilde{x}}_{i}^{I}

has three components (Y/U/V) while the others have two.

μ_{i j}

indicates the belongingness of block i to cluster area j, and c stands for the center vector. Apparently, the target function J considers blocks’ position together with pixel and motion similarities. To solve non-spherical clusters, we use Kernel Fuzzy C-Means (KFCM) method for block classification, which induces the kernel function

K ({\tilde{x}}_{i}, {\tilde{c}}_{j}) = exp (- \frac{∥ {\tilde{x}}_{i} - {\tilde{c}}_{j} ∥^{2}}{σ^{2}})

in criteria computation. To adaptively classify these foreground blocks, we propose a clustering algorithm which optimally classifies blocks into adequate areas, see Algorithm 3 for details.

Algorithm 3 KFCM Motion Areas Clustering Algorithm

1:: Set $k = 1$ , ${\tilde{c}}_{1} = E ({\tilde{x}}_{i}^{p}, {\tilde{x}}_{i}^{m v}, {\tilde{x}}_{i}^{I})$ .
2:: while $k \geq K_{m a x}$ do
3:: Segment N blocks into k clusters by KFCM algorithm with the stopping criteria: $\frac{| J^{t + 1} - J^{t} |}{J^{t}} < ϵ$ .
4:: Compute $J_{j} = \sum_{i = 1}^{| {\tilde{c}}_{j} |} (∥ {\tilde{x}}_{i}^{p} - {\tilde{c}}_{j}^{p} ∥^{2} + ∥ {\tilde{x}}_{i}^{m v} - {\tilde{c}}_{j}^{m v} ∥^{2} + ∥ {\tilde{x}}_{i}^{I} - {\tilde{c}}_{j}^{I} ∥^{2})$ .
5:: if ${max}_{j \in (0, k]} J_{j} < ε | {\tilde{c}}_{j} |$ and $| {\tilde{c}}_{j} | \geq N / (k + o)$ then
6:: return $μ$ , k, ${\tilde{c}}_{j}, j \in (0, k]$ .
7:: end if
8:: $k = k + 1$ ;
9:: end while
10:: return $μ$ , $K_{m a x}$ , ${\tilde{c}}_{j}, j \in (0, K_{m a x}]$ .

Notice that the clustering algorithm segments the foreground blocks into k clusters, where k increases from 1 to

K_{m a x}

iteratively. At each turn, it computes the target function (

J_{j}

) for each cluster j. When each cluster’s size satisfies that

| {\tilde{c}}_{j} | \geq N / (k + o)

(o is an offset, empirically set as 2) and

J_{j}

is constrained on a upper bound

ε | {\tilde{c}}_{j} |

, the algorithm stops and returns the optimal cluster labels.

To specify the parameter influences on segmentation results, we set

K_{m a x} = 4

and change

ε

from 0 to 1. The clustering results are displayed in Figure 7, whereas three sequences are included for comparisons. We find that when

ε \geq 0.25

, the algorithm outputs two clusters; as

ε

decreases, more clusters are generated with margin details. Next, we collect the margin points for each cluster. Let us assume that

M_{i} = {b_{t}, b_{l}, b_{r}, b_{b}}

, in which collection

M_{i}

contains the top, left, right and bottom block for cluster i. Then, a rectangle area can be drawn based on these blocks. However, some “noise” blocks may reduce the precision of determining

M_{i}

. For example, in Figure 8c, the rectangle area covers many blank areas that are useless. To solve this problem, we propose a motion area pruning algorithm, see Algorithm 4 for details. Please note that the algorithm first traverses each block cluster from its margin points in a depth-first way. The searching procedure terminates when it reaches the bounds or the visiting block is not ROI block (

R^{n}

). Then, the traversed data is collected into subset

B

. If the size of

B

is small enough, the blocks in

B

are ignored as “noise” ones. Finally, the algorithm returns the pruned block set

B_{1}

. Please note that the search range r and step s are set as five and two block-distance empirically, in which larger value of r would eliminate less “noise” blocks. Figure 8 shows the boundaries of motion areas before and after the pruning algorithm. We can observe that the proposed method reduces the “noise” blocks effectively and obtains the precise rectangle.

Algorithm 4 Motion Area Pruning Algorithm

S e a r c h (b, V i s, B)

:

Input:b: The current processing block;

V i s

: Whether the block is visited;

B

: Temp block lists; Search range r and step s.

1:: If $C h e c k B o u n d (b) = =$ false or $V i s (b)$ or $T y p e (b) = = R^{n}$ , return.
2:: $V i s (b)$ =true, append b to B.
3:: for $i = 1$ ; $i < r$ ; $i + = s$ do
4:: $S e a r c h (b (x \pm i, y \pm i), V i s, B)$ .
5:: end for

P r u n e (M)

:

Input: : Margin block set

M

; Original block set

B_{0}^{i}

. Pruned block set

B_{1}^{i} = \emptyset

(

0 < i \leq k

).

1:: for each $M_{i}$ in $M$ do
2:: for each $b_{j}$ in $M_{i}$ do
3:: $S e a r c h (b_{j}, V i s, B_{j})$ .
4:: If $| B_{j} | > \frac{B_{0}^{i}}{| M_{i} | + o}$ , $B_{1}^{i} ⋃ = B_{j}$ .
5:: end for
6:: end for
7:: return $⋃ B_{1}^{i}$ .

2.1.6. Mask Drawing and ROI Extracting

After obtaining the boundaries of motion areas for each cluster, we consider drawing the margin masks for the motion areas. First of all, the “edge” blocks of each cluster are filtered by edge detection methods (i.e., Canny, Sobel). Let us assume that

B_{E}^{i}

is the set of “edge” blocks of the ith cluster, the shape of mask is then generated by the Gift-Wrapping algorithm in [66], see Figure 9 for details.

Please note that the Gift-Wrapping algorithm performs as a convex hull scan method, in which the convex shape of the motion area is clearly formed. For alternating the cover area, we define the center of motion area i as:

C_{m}^{i} (r, c) = \frac{1}{| B_{E}^{i} |} \sum_{b_{E} \in B_{E}^{i}} b_{E} (r, c),

(22)

where

C_{m}^{i} (r, c)

denotes the center coordinate (

(r, c)

means row and column). We also declare an inflation factor

ι

which controls the scaling extent of the mask. Given a fixed

ι

, the updated “edge” blocks

b_{E}^{'} (r, c)

can be expressed as:

\{\begin{matrix} b_{E}^{'} (r) & = b_{E} (r) + (ι - 1) \cdot (b_{E} (r) - C_{m} (r)) \\ b_{E}^{'} (c) & = b_{E} (c) + (ι - 1) \cdot (b_{E} (c) - C_{m} (c)) \end{matrix}

(23)

To specify how

ι

influences the mask shape, we change

ι

from 1 to 1.4, the visual effects of which is displayed in Figure 9c,d. After acquiring the mask data, we can determine the result of foreground objects inside the mask. We use the extracting method in our previous work [14], it implements a multidimensional segmentation method with the Expectation Maximization (EM) algorithm to segment the mask data into an optimal number of clusters. Then, for each cluster inside the mask, it computes the KLD (

D (p_{b} ∥ p_{f})

) between the background pixel histogram

p_{b} (x)

and foreground pixel histogram

p_{f} (x)

. When

D (p_{b} ∥ p_{f})

is below a threshold

ξ_{x}

, the current cluster is treated as the background cluster and it is removed from the mask. At the end of each iteration,

p_{b}

is updated based on the removed clusters and sampling background blocks, see Formula 25 in our previous work [14] for details.

3. Experiments

3.1. Visual Effect Analysis Experiments

3.1.1. Parameters Settings

It is essential to set some parameters before performing the visual analysis experiment. For ROI inter model, two threshold

t h 1

and

t h 2

are set to 0.5 empirically since the score is normalized, and the norm of the real motion vector should satisfy

∥ m v_{b_{i}} ∥ > \sqrt{2}

as a condition to determine whether block

b_{i}

belongs to motion mask

M

. For ROI intra model, the initial value of

ϑ (τ, μ, Σ)

is set manually instead of a random value for reducing iteration steps. Practically,

τ

is set initially as

{0.5, 0.5}

for

K = 2

classes, and

μ

,

Σ

are recommend to be acquired by running the K-means cluster (

K = 2

) method first and use the outputs as their initial values.

z_{i}

is assigned by an uniform distributed binary array

{0, 1, 0, 1, \dots}

with the same probability for each value.

β

is set as 1.5 and

ξ_{1}, ξ_{2}

are initiated as 8 and 0.12 respectively. Besides, the max-iteration steps for EM algorithm are constrained to 10 for reducing the computing costs.

3.1.2. Visual Comparisons on CDNet Dataset

In this sub-experiment, the performance of our method is evaluated by using five video clips from the change detection data set (CDNet) [61]. These clips contain: Shade (periodic motion, shadow), Office (slow-moving), Highway (fast-moving) Winter (camouflage), and Canoe (dynamic background). They are displayed from top to bottom in Figure 10. Seven methods are included for visual comparisons, and they are classified into several categories: (1) Matrix completion approaches: Decolor [67], PCP [22], GROUSE [27]; (2) Probabilistic modeling: GMM [21], VBRPCA [26]; (3) Background modeling: Vibe [34], MKFC [38]. For fast-moving and slow-moving clips, all of these methods are able to reveal ROI details (moving objects) while some noise pixels especially in PCP [22] and GROUSE [27]. When it faces to the shadow or dynamic background source, most of these methods fail because it is hard to handle backgrounds with varying ambient variations. The proposed method shows better results since it ignores ambient variations by object filtering. When processing some sequences like Winter, the proposed method fades in ROI details a little bit since the target contains too small pixels for

M

to retrieve, but it still meet a satisfactory result compared with other approaches.

3.1.3. Visual Comparisons On AVC/HEVC Sequences

In this sub-experiment, we test the proposed method among real AVC/HEVC (H.264/H.265) video sample sequences by comparing with other approaches. All the methods are performed based on the standard video compression software named HM13.0 for HEVC. The tested sequence are introduced with a raster order in Figure 11a correspondingly: BQSquare (zooming, slowly move), Carphone (dynamic background), Container (slowly and fast move), Ice (low-contrast fore-background move), KristenAndSara (slowly high definition move), Mobile (zooming, low-contrast fore-background move), Racehorse (large ROI fast move), Stefan (panning, fast move). For comparison purpose, two well-suited for real-time processing embedded in video codec methods are introduced. They are SSGoDec [29] and

Σ - Δ

motion detection [68], where the former uses outlier pursuit (OP) for low-rank matrix estimation while the latter consists a simple non-linear recursive approximation for background image based on an elementary increment/decrement value. The visual effect result is displayed in Figure 11, where the motion vector output

M

(represented as blocks) is displayed in Figure 11b after global motion detection and compensation. One can see that the proposed method achieves better performance especially on sequences with global motion (i.e., Mobile, Stefan) than others.

Σ - Δ

can extract an explicit margin details specifically in low-contrast fore-background sequences (i.e., Ice), however, it also brings stable details such as the barricades. Our method is able to provide similar object details as

Σ - Δ

and it also focuses on moving objects with stable details neglected. Similar result can be found in Racehorse.

3.1.4. Miscellaneous Analysis

The performance of ROI detection and the computing time for different approaches are analyzed in this sub-experiment. A state-of-the-art measurement for the performance analysis is to plot the ROC curve and compute the area under the curve (AUC). The collected data for test includes the ROI data (white pixels) of the ground truth together with the same amount of the non-ROI data (black pixels), to make sure that we have the same positive and negative samples. The label of classifier is marked as 1 if it belongs to ROI and 0 if not. Unfamiliar with visual effect comparisons, the outputs

I_{O}

of these detection methods are gray-scale image (0-255) without binarization by threshold

t h

, and the tested data is sorted by a score metric

\hat{s} (x)

in a descend order. Also, this sub-experiment addressed some performance metrics analyzes including recall, false positive rate (FPR), false negative rate (FNR), precision (Prec), and F-measure, whose definitions are given by:

\{\begin{matrix} Recall & = tp / (tp + fn); Prec = tp / (fp + tp); \\ FNR & = fn / (fn + tp); FPR = fp / (fp + tn); \\ F - measure & = (2 \times Recall \times Prec) / (Recall + Prec) \end{matrix}

(24)

where tp, fp, tn, fn denote true positive, false positive, true negative, false negative rate respectively. To begin ROC analysis, we define set

T e

as the set of test examples (detection outputs);

T e^{+}

and

T e^{-}

are the positive (foreground) and negative (background) subsets, respectively, of

T e

. Meanwhile, a score function,

\hat{s} (x)

, denotes the score function of the sample pixel x. Then, the ranking accuracy is defined as:

r a n k - a c c = \frac{\sum_{x \in T e^{+}, x^{'} \in T e^{+}} δ (\hat{s} (x) > \hat{s} (x^{'})) + \frac{1}{2} δ (\hat{s} (x) = \hat{s} (x^{'}))}{P o s \cdot N e g},

(25)

where

δ (\cdot)

is the indicator function, and

P o s

and

N e g

are the number of positive and negative samples, respectively, in

T e

. For convenience, the output image data of each detection method are normalized between 0 and 1 and binarized via a threshold

ξ

, and the score function

\hat{s} (x)

is defined as:

\hat{s} (x) = \{\begin{matrix} {(I_{g} (x) - ξ)}^{2}, I_{g} (x) > ξ \\ - {(I_{g} (x) - ξ)}^{2}, I_{g} (x) < ξ . \end{matrix}

(26)

Figure 12a,b provide the ROC curve for AVC/HEVC dataset sequences with three detecting methods, and Figure 12a,b stand for CDNet dataset sequences with six detecting methods. One can see that the proposed method has a larger AUC value. In other words, the demand of detection precision for foreground/background classification is reached.

For computation time analysis, we test the performance of each detection method both on CDNet and on AVC/HEVC Sequences. As for CDNet dataset, three sequences (Canoe, Office, Highway) are selected, and each sequence has separated in batches by 20 frames. Figure 13 shows the average batch-running time for these detection methods tested on CDNet dataset. For AVC/HEVC dataset, four different resolutions of sequences are chosen, which are: Container (QCIF), Ice (CIF), Racehorse (416 × 240), KristenAndSara (1280 × 720). Figure 13a displays the time consumption data for different sequences while Figure 13b provides the time consumption data for sequences differ from data resolutions. We can find that the proposed method achieves a lower computation time and better performance on detection precision compared with other approaches. Also, the proposed method intends to find a balance between time costs and detection accuracy.

3.1.5. Performance Comparisons with Motion Vector Extraction Methods

In this subsection, we add some extra experiments that evaluate the performance compared with some other motion vector extraction methods. Based on the authors’ acknowledgements, previous works have introduced the idea of extracting motion vectors of moving objects as ROI detection methods. For example, in [69,70], the authors proposed a two-stage method to identificate moving objects. At the first stage, moving objects are detected based on the analysis of motion vectors in predicted MBs and DCT coefficients in intra-coded MBs. Then, the authors proposed motion vector informations such as spatial correlation, temporal correlation to identify moving objects. Also, they considered the solutions for objects with intra-coded MBs. Niu et al. [71] presented a novel segmentation approach to extract moving objects in the H.264 compressed domain, in which motion vectors are first refined by the motion relativity in both space and time, then clustered based on their differences with the reference of global motion vector. Similarly, work [72] used Global Motion Compensation (GMC) to effectively extract moving objects based on motion vectors. The authors proposed a search-matching algorithm to compute the background motion vectors for global motion extraction. Then, foreground moving objects are identified by excluding global motion vectors. They also introduced a statistic variable named fourth-order moment to distinguish the background signal and moving objects. Recently, Serhan et al. [73] propose a hybrid tracking method which detects moving objects in videos compressed according to H.265/HEVC standard. They introduced Markov Random Field (MRF) model to captures spatial and temporal coherence of the moving object. Also, background/foreground color modeling is proposed both for I frames and for P frames.

Figure 14 displays some visual effects comparisons among those previous works corresponding with motion vector extraction methods. we have selected five sequences for performance tests. These sequences contain both big or small objects (people) with various background changes. Three methods are introduced for comparison purposes; they are Yokoyama et al. [69], Shun et al. [72] and Serhan et al. [73]. As seen in Figure 14, the proposed method meets a satiable results that clearly reflects the details of the moving objects. For detection accuracy analysis, we plot the presion curves among the four tested methods. This results are displayed in Figure 15, one can see that the proposed method can meet the robustness demands for different test cases. Also, the detection precison is more stable than others among the selected test sequences.

4. Conclusions

In this paper, we design a new foreground ROI detection method for video sequences based on two ROI models: “inter” and “intra”. The “inter” model is in charge of acquiring the real motion blocks via global motion detection and compensation. Several detection metrics and a decision tree model are proposed in the inter model, including camera status decision, motion vector subtraction. Also, different categories of camera statuses are taken into consideration with company of global motion compensation. The “intra” model is in charge of revealing the object details by the object filtering approach, and it extracts both the edge and inner part of objects in the frame. Notice that ROI detection focuses on moving details of objects. The output of the “inter” model gives the motion information however not specific which is represented as blocks, while the “intra” model uses this motion information as the pilot and it reconstructs the output by image segmentation and combination using the proposed cluster approach together with a motion area pruning algorithm. Experimental results demonstrate that the proposed method gets a better result both on detection precision and on computational time. Future works will include machine learning approaches to detect the actions of ROI objects and improves the detection quality specifically in the action areas, and the parameters of the proposed method will be optimized which suits video with various characteristics.

Author Contributions

Conceptualization, Z.Z.; Methodology, Z.Z. and T.J.; Software, B.D.; Visualization, M.G.; Writing–review & editing, X.L.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Faugeras, O.D. Digital color image processing within the framework of a human visual model. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 380–393. [Google Scholar] [CrossRef]
Song, H.; Kuo, C.C. A region-based H.263+ codec and its rate control for low VBR video. IEEE Trans. Multimed. 2004, 6, 489–500. [Google Scholar] [CrossRef]
Tong, L.; Rao, K. Region of interest based H.263 compatible codec and its rate control for low bit rate video conferencing. In Proceedings of the 2005 International Symposium on Intelligent Signal Processing and Communication Systems, Hong Kong, China, 13–16 December 2005; pp. 249–252. [Google Scholar] [CrossRef]
Mukherjee, D.; Bankoski, J.; Grange, A.; Han, J.; Koleszar, J.; Wilkins, P.; Xu, Y.; Bultje, R. The latest open-source video codec VP9 - An overview and preliminary results. In Proceedings of the Picture Coding Symposium (PCS), San Jose, CA, USA, 8–11 December 2013; pp. 390–393. [Google Scholar] [CrossRef]
Sullivan, G.; Ohm, J.; Han, W.J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Wang, H.; Chang, S.F. A highly efficient system for automatic face region detection in MPEG video. IEEE Trans. Circuits Syst. Video Technol. 1997, 7, 615–628. [Google Scholar] [CrossRef]
Hartung, J.; Jacquin, A.; Pawlyk, J.; Rosenberg, J.; Okada, H.; Crouch, P. Object-oriented H.263 compatible video coding platform for conferencing applications. IEEE J. Sel. Areas Commun. 1998, 16, 42–55. [Google Scholar] [CrossRef]
Doulamis, N.; Doulamis, A.; Kalogeras, D.; Kollias, S. Low bit-rate coding of image sequences using adaptive regions of interest. IEEE Trans. Circuits Syst. Video Technol. 1998, 8, 928–934. [Google Scholar] [CrossRef]
Lam, C.F.; Lee, M. Video segmentation using color difference histogram. In Multimedia Information Analysis and Retrieval, Proceedings of the IAPR International Workshop, MINAR’ 98, Hong Kong, China, 13–14 August 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 159–174. [Google Scholar] [CrossRef]
Schafer, R. MPEG-4: A multimedia compression standard for interactive applications and services. Electron. Commun. Eng. J. 1998, 10, 253–262. [Google Scholar] [CrossRef]
Zhou, X.; Yang, C.; Zhao, H.; Yu, W. Low-Rank Modeling and Its Applications in Image Analysis. CoRR 2014, arXiv:1401.3409. [Google Scholar] [CrossRef]
Gupta, S.; Davidson, J.; Levine, S.; Sukthankar, R.; Malik, J. Cognitive Mapping and Planning for Visual Navigation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhang, Z.; Jing, T.; Han, J.; Xu, Y.; Zhang, F. A New Rate Control Scheme For Video Coding Based On Region Of Interest. IEEE Access 2017, 5, 13677–13688. [Google Scholar] [CrossRef]
Zhang, Z.; Jing, T.; Han, J.; Xu, Y.; Li, X. Flow-Process Foreground Region of Interest Detection Method for Video Codecs. IEEE Access 2017, 5, 16263–16276. [Google Scholar] [CrossRef]
Zhang, Z.; Jing, T.; Han, J.; Xu, Y.; Li, X.; Gao, M. ROI-Based Video Transmission in Heterogeneous Wireless Networks With Multi-Homed Terminals. IEEE Access 2017, 5, 26328–26339. [Google Scholar] [CrossRef]
Gonzalez, D.; Botella, G.; Meyer-Baese, U.; Garcia, C.; Sanz, C.; Prieto-Matas, M.; Tirado, F. A low cost matching motion estimation sensor based on the NIOS II microprocessor. Sensors 2012, 12, 13126–13149. [Google Scholar] [CrossRef] [PubMed]
Gonzalez, D.; Botella, G.; Garcia, C.; Prieto, M.; Tirado, F. Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor. EURASIP J. Adv. Signal Process. 2013, 2013, 118. [Google Scholar] [CrossRef] [Green Version]
Nunez-Yanez, J.L.; Nabina, A.; Hung, E.; Vafiadis, G. Cogeneration of Fast Motion Estimation Processors and Algorithms for Advanced Video Coding. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2012, 20, 437–448. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; Volume 2, p. 252. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W.E.L. Learning Patterns of Activity Using Real-Time Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 747–757. [Google Scholar] [CrossRef]
Zivkovic, Z. Improved adaptive Gaussian mixture model for background subtraction. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 26–26 August 2004; Volume 2, pp. 28–31. [Google Scholar] [CrossRef]
Cand, E.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis?: Recovering low-rank matrices from sparse errors. In Proceedings of the 2010 IEEE Sensor Array and Multichannel Signal Processing Workshop, Jerusalem, Israel, 4–7 October 2010; pp. 201–204. [Google Scholar] [CrossRef]
Gao, Z.; Cheong, L.F.; Wang, Y.X. Block-Sparse RPCA for Salient Motion Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1975–1987. [Google Scholar] [CrossRef] [PubMed]
Guyon, C.; Bouwmans, T.; Zahzah, E.H. Foreground detection based on low-rank and block-sparse matrix decomposition. In Proceedings of the 2012 19th IEEE International Conference on Image Processing (ICIP), Orlando, FL, USA, 30 September–3 October 2012; pp. 1225–1228. [Google Scholar] [CrossRef]
Xu, H.; Caramanis, C.; Sanghavi, S. Robust PCA via Outlier Pursuit. IEEE Trans. Inf. Theory 2012, 58, 3047–3064. [Google Scholar] [CrossRef] [Green Version]
Derin Babacan, S.; Luessi, M.; Molina, R.; Katsaggelos, A.K. Sparse Bayesian Methods for Low-Rank Matrix Estimation. IEEE Trans. Signal Process. 2012, 60, 3964–3977. [Google Scholar] [CrossRef]
Balzano, L.; Nowak, R.; Recht, B. Online identification and tracking of subspaces from highly incomplete information. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, VA, USA, 29 September–1 October 2010; pp. 704–711. [Google Scholar] [CrossRef]
Peng, Y.; Ganesh, A.; Wright, J.; Xu, W.; Ma, Y. RASL: Robust Alignment by Sparse and Low-Rank Decomposition for Linearly Correlated Images. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2233–2246. [Google Scholar] [CrossRef]
Zhou, T.; Tao, D. GoDec: Randomized Low-rank and Sparse Matrix Decomposition in Noisy Case. In Proceedings of the 28th ICML, Bellevue, WA, USA, 28 June–2 July 2011; pp. 33–40. [Google Scholar]
Chan, A.B.; Vasconcelos, N. Layered Dynamic Textures. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1862–1879. [Google Scholar] [CrossRef]
Chan, A.B.; Vasconcelos, N. Modeling, Clustering, and Segmenting Video with Mixtures of Dynamic Textures. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 909–926. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mumtaz, A.; Coviello, E.; Lanckriet, G.R.G.; Chan, A.B. Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1606–1621. [Google Scholar] [CrossRef]
Chan, A.B.; Mahadevan, V.; Vasconcelos, N. Generalized Stauffer–Grimson background subtraction for dynamic scenes. Mach. Vis. Appl. 2011, 22, 751–766. [Google Scholar] [CrossRef]
Barnich, O.; Droogenbroeck, M.V. ViBe: A Universal Background Subtraction Algorithm for Video Sequences. IEEE Trans. Image Process. 2011, 20, 1709–1724. [Google Scholar] [CrossRef] [PubMed]
Yao, J.; Odobez, J.M. Multi-Layer Background Subtraction Based on Color and Texture. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
Maddalena, L.; Petrosino, A. A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications. IEEE Trans. Image Process. 2008, 17, 1168–1177. [Google Scholar] [CrossRef] [PubMed]
Panda, D.K.; Meher, S. Detection of Moving Objects Using Fuzzy Color Difference Histogram Based Background Subtraction. IEEE Signal Process. Lett. 2016, 23, 45–49. [Google Scholar] [CrossRef]
Chiranjeevi, P.; Sengupta, S. Detection of Moving Objects Using Multi-channel Kernel Fuzzy Correlogram Based Background Subtraction. IEEE Trans. Cybern. 2014, 44, 870–881. [Google Scholar] [CrossRef]
Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
Kolmogorov, V.; Zabin, R. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 147–159. [Google Scholar] [CrossRef] [Green Version]
Boykov, Y.; Kolmogorov, V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1124–1137. [Google Scholar] [CrossRef] [Green Version]
Tang, M.; Gorelick, L.; Veksler, O.; Boykov, Y. GrabCut in One Cut. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1769–1776. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 346–361. [Google Scholar]
Szegedy, C.; Toshev, A.; Erhan, D. Deep Neural Networks for Object Detection. Adv. Neural Inf. Process. Syst. 2013, 26, 2553–2561. [Google Scholar]
Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable Object Detection Using Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 2155–2162. [Google Scholar]
Szegedy, C.; Reed, S.E.; Erhan, D.; Anguelov, D. Scalable, High-Quality Object Detection. CoRR 2014, arXiv:1412.1441. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. CoRR 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. CoRR 2015, arXiv:1512.02325. [Google Scholar]
Huang, G.; Liu, Z.; Weinberger, K.Q. Densely Connected Convolutional Networks. CoRR 2016, arXiv:1608.06993. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. CoRR 2016, arXiv:1605.06409. [Google Scholar]
Pinheiro, P.H.O.; Collobert, R.; Dollár, P. Learning to Segment Object Candidates. CoRR 2015, arXiv:1506.06204. [Google Scholar]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR 2013, arXiv:1311.2524. [Google Scholar]
Yuan, J.; Wu, Y. Mining visual collocation patterns via self-supervised subspace learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2012, 42, 334–346. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Jing, T.; Tian, C.; Cui, P.; Li, X.; Gao, M. Objects Discovery Based on Co-Occurrence Word Model with Anchor-Box Polishing. IEEE Trans. Circuits Syst. Video Technol. 2019. [Google Scholar] [CrossRef]
Chen, Y.M.; Bajic, I.V. A Joint Approach to Global Motion Estimation and Motion Segmentation From a Coarsely Sampled Motion Vector Field. IEEE Trans. Circuits Syst. Video Technol. 2011, 21, 1316–1328. [Google Scholar] [CrossRef]
Unger, M.; Asbach, M.; Hosten, P. Enhanced background subtraction using global motion compensation and mosaicing. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 2708–2711. [Google Scholar] [CrossRef]
Jin, Y.; Tao, L.; Di, H.; Rao, N.I.; Xu, G. Background modeling from a free-moving camera by Multi-Layer Homography Algorithm. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 1572–1575. [Google Scholar] [CrossRef]
Guerreiro, R.F.C.; Aguiar, P.M.Q. Global Motion Estimation: Feature-Based, Featureless, or Both?! In Image Analysis and Recognition, Proceedings of the Third International Conference, ICIAR 2006, voa de Varzim, Portugal, 18–20 September 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 721–730. [Google Scholar]
The Change Detection Dataset. Available online: http://changedetection.net (accessed on 25 March 2019).
The HEVC/AVC Official Test Sequences, Online. Available online: ftp://hevc@ftp.tnt.uni-hannover.de/testsequences (accessed on 25 March 2019).
Su, Y.; Sun, M.T.; Hsu, V. Global motion estimation from coarsely sampled motion vector field and the applications. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 232–242. [Google Scholar] [CrossRef]
Smolic, A.; Hoeynck, M.; Ohm, J.R. Low-complexity global motion estimation from P-frame motion vectors for MPEG-7 applications. In Proceedings of the 2000 International Conference on Image Processing (Cat. No.00CH37101), Vancouver, BC, Canada, 10–13 September 2000; Volume 2, pp. 271–274. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In Readings in Computer Vision: Issues, Problems, Principles, and Paradigms; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1987; pp. 726–740. [Google Scholar] [CrossRef]
Jarvis, R. On the identification of the convex hull of a finite set of points in the plane. Inf. Process. Lett. 1973, 2, 18–21. [Google Scholar] [CrossRef]
Zhou, X.; Yang, C.; Yu, W. Moving Object Detection by Detecting Contiguous Outliers in the Low-Rank Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 597–610. [Google Scholar] [CrossRef] [PubMed]
Rueda, L.; Mery, D.; Kittler, J. Σ-Δ Background Subtraction and the Zipf Law. In Progress in Pattern Recognition, Image Analysis and Applications, Proceedings of the 12th Iberoamericann Congress on Pattern Recognition, CIARP 2007, Valparaiso, Chile, 13–16 November 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 42–51. [Google Scholar] [CrossRef]
Yokoyama, T.; Iwasaki, T.; Watanabe, T. Motion Vector Based Moving Object Detection and Tracking in the MPEG Compressed Domain. In Proceedings of the 2009 Seventh International Workshop on Content-Based Multimedia Indexing, Chania, Greece, 3–5 June 2009; pp. 201–206. [Google Scholar] [CrossRef]
Yoneyama, A.; Nakajima, Y.; Yanagihara, H.; Sugano, M. Moving object detection and identification from MPEG coded data. In Proceedings of the 1999 International Conference on Image Processing (Cat. 99CH36348), Kobe, Japan, 24–28 October 1999; Volume 2, pp. 934–938. [Google Scholar] [CrossRef]
Niu, C.; Liu, Y. Moving Object Segmentation Based on Video Coding Information in H.264 Compressed Domain. In Proceedings of the 2009 2nd International Congress on Image and Signal Processing, Tianjin, China, 17–19 October 2009; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, S.; Su, X.; Xie, L. Global motion compensation for image sequences and motion object detection. In Proceedings of the 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), Taiyuan, China, 22–24 October 2010; Volume 1, pp. V1–406–V1–409. [Google Scholar] [CrossRef]
Gul, S.; Meyer, J.T.; Hellge, C.; Schierl, T.; Samek, W. Hybrid video object tracking in H.265/HEVC video streams. In Proceedings of the 2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP), Montreal, QC, Canada, 21–23 September 2016; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. Basic structure of ROI Detection model.

Figure 2. An illustration of sampling areas.

Figure 3. An illustration of the camera zoom in and out status.

Figure 4. Motion vector decompostion on horizontal and vertical directions.

Figure 5. A decision tree model for global motion detection.

Figure 6. Visual comparisons of foreground motion blocks extraction with global motion among different approaches. (a) Sequences (From the top to bottom): Coastguards, Flower, Stefan. (b) Motion blocks distribution. (c) Our method. (d) The GD approach [63]. (e) The LSS-ME approach [64]. (f) The RANSAC approach [65].

Figure 7. The motion segmentation results for different values of

ε

(

K_{m a x} = 4

): (a) Raw sequences. (From top to bottom: Suzie, Ice, Foreman) (b) Foreground motion blocks.

Figure 7. The motion segmentation results for different values of

ε

(

K_{m a x} = 4

): (a) Raw sequences. (From top to bottom: Suzie, Ice, Foreman) (b) Foreground motion blocks.

Figure 8. Visual effects before and after the pruning algorithm for motion areas: (a) Raw sequences. (From top to bottom: Suzie, Ice, Foreman) (b) Foreground motion blocks. (c) Before pruning. (d) After pruning.

Figure 9. Visual effects of mask drawing and mask inflation: (a) Raw sequences. (From top to bottom: CoastGuard, Suzie. (b)“edge” feature blocks. (c) Mask shape with

ι = 1.4

. (d) Mask shape with

ι = 1

.

Figure 9. Visual effects of mask drawing and mask inflation: (a) Raw sequences. (From top to bottom: CoastGuard, Suzie. (b)“edge” feature blocks. (c) Mask shape with

ι = 1.4

. (d) Mask shape with

ι = 1

.

Figure 10. Visual effect comparisons through different methods on CDNet. (a) Source frames. (b) Ground truth. (c) The proposed method. (d) VBRPCA [26]. (e) GROUSE [27]. (f) MKFC [38]. (g) PCP [22]. (h) ViBe [34]. (i) GMM [21]. (j) Decolor [67].

Figure 11. Visual effect comparisons through different methods on HEVC/AVC test sequences. (a) Original frames. (b) Ground truth. (c) Motion vector mask. (d) The proposed method. (e) SSGodec [29]. (f) Sigma-Delta [68].

Figure 12. ROC curves for different test sequences based on several detection methods. (a) BQSquare; (b) Stefan; (c) PeopleShade; (d) Highway.

Figure 13. Running time evaluation for several ROI detection approaches. (a) Time consumption varies from different sequences. (b) Time consumption varies from different resolution.

Figure 14. Visual effect comparisons among various motion vector extraction method (a) Original sequences (from top to bottom: Pedestrians #3, Pedestrians #15, Johnny #31, Office #18, Akiyo #10, Party #27 ). (b) Yokoyama et al. [69]. (c) Shun et al. [72]. (d) Serhan et al. [73]. (e) Our method.

Figure 15. Detection precision curves for test sequences among different motion vector extraction methods. (a) Akiyo (250 frames); (b) Carphone (300 frames); (c) Pedestrians (250 frames); (d) Salesman (400 frames).

Table 1. Running time tests for GME approaches. (Platform: OpenCV 3.1, Core-i7 CPU 2.8GHZ × 4).

Sequences	Our Method	GD	LSS-ME	RANSAC
Flower	0.097 s	0.181 s	0.466 s	0.240 s
Stefan	0.293 s	0.665 s	0.712 s	0.497 s
CoastGuard	0.225 s	0.604 s	0.811 s	0.472 s

Table 2. Performance of camera status decision with different settings of thresholds.

Seq	Status	$th 1$	$th 2$	prec	Recall	F-Score
BQTerrace Flower Foreman	panning /rests	0.1	0.3	0.818	0.901	0.857
		0.3	0.3	0.940	0.901	0.922
		0.5	0.3	0.846	0.880	0.863
		0.7	0.3	0.702	0.801	0.747
PartyScene BQSquare Mobile	zooming /rests	0.3	0.1	0.649	0.740	0.692
		0.3	0.3	0.722	0.780	0.750
		0.3	0.5	0.773	0.820	0.796
		0.3	0.7	0.693	0.680	0.686

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Jing, T.; Ding, B.; Gao, M.; Li, X. A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs. Appl. Sci. 2019, 9, 2670. https://doi.org/10.3390/app9132670

AMA Style

Zhang Z, Jing T, Ding B, Gao M, Li X. A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs. Applied Sciences. 2019; 9(13):2670. https://doi.org/10.3390/app9132670

Chicago/Turabian Style

Zhang, Zhewei, Tao Jing, Bowen Ding, Meilin Gao, and Xuejing Li. 2019. "A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs" Applied Sciences 9, no. 13: 2670. https://doi.org/10.3390/app9132670

APA Style

Zhang, Z., Jing, T., Ding, B., Gao, M., & Li, X. (2019). A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs. Applied Sciences, 9(13), 2670. https://doi.org/10.3390/app9132670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Model-Based Approach of Foreground Region of Interest Detection for Video Codecs

Abstract

1. Introduction

1.1. Related Work

1.2. Overview and Contributions

2. Detection Model

2.1. Foreground Motion Blocks Extraction

2.1.1. Sampling Areas Determination

2.1.2. Camera Status Detection

2.1.3. Global Motion Compensation

2.1.4. Discussion

2.1.5. Marking Motion Areas

2.1.6. Mask Drawing and ROI Extracting

3. Experiments

3.1. Visual Effect Analysis Experiments

3.1.1. Parameters Settings

3.1.2. Visual Comparisons on CDNet Dataset

3.1.3. Visual Comparisons On AVC/HEVC Sequences

3.1.4. Miscellaneous Analysis

3.1.5. Performance Comparisons with Motion Vector Extraction Methods

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI