<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sensors</journal-id>
<journal-title>Sensors</journal-title>
<issn pub-type="epub">1424-8220</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/s121115376</article-id>
<article-id pub-id-type="publisher-id">sensors-12-15376</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>GrabCut-Based Human Segmentation in Video Sequences</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Hernández-Vela</surname><given-names>Antonio</given-names></name><xref ref-type="aff" rid="af1-sensors-12-15376"><sup>1</sup></xref><xref ref-type="aff" rid="af2-sensors-12-15376"><sup>2</sup></xref><xref ref-type="corresp" rid="c1-sensors-12-15376"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Reyes</surname><given-names>Miguel</given-names></name><xref ref-type="aff" rid="af1-sensors-12-15376"><sup>1</sup></xref><xref ref-type="aff" rid="af2-sensors-12-15376"><sup>2</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Ponce</surname><given-names>Víctor</given-names></name><xref ref-type="aff" rid="af1-sensors-12-15376"><sup>1</sup></xref><xref ref-type="aff" rid="af2-sensors-12-15376"><sup>2</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Escalera</surname><given-names>Sergio</given-names></name><xref ref-type="aff" rid="af1-sensors-12-15376"><sup>1</sup></xref><xref ref-type="aff" rid="af2-sensors-12-15376"><sup>2</sup></xref></contrib></contrib-group>
<aff id="af1-sensors-12-15376">
<label>1</label> Departamento MAIA, Universitat de Barcelona, Gran Via 585, 08007 Barcelona, Spain; E-Mails: <email>mreyes@cvc.uab.cat</email> (M.R.); <email>vponce@cvc.uab.cat</email> (V.P.); <email>sergio@maia.ub.es</email> (S.E.)</aff>
<aff id="af2-sensors-12-15376">
<label>2</label> Centre de Visió per Computador, Campus UAB, Edifici O, 08193 Bellaterra, Barcelona, Spain</aff>
<author-notes>
<corresp id="c1-sensors-12-15376">
<label>*</label> Author to whom correspondence should be addressed; E-Mail: <email>ahernandez@cvc.uab.cat</email>; Tel.: +34-93-402-1897; Fax: +34-93-402-1601.</corresp></author-notes>
<pub-date pub-type="collection">
<year>2012</year></pub-date>
<pub-date pub-type="epub">
<day>09</day>
<month>11</month>
<year>2012</year></pub-date>
<volume>12</volume>
<issue>11</issue>
<fpage>15376</fpage>
<lpage>15393</lpage>
<history>
<date date-type="received">
<day>04</day>
<month>09</month>
<year>2012</year></date>
<date date-type="rev-recd">
<day>01</day>
<month>11</month>
<year>2012</year></date>
<date date-type="accepted">
<day>06</day>
<month>11</month>
<year>2012</year></date></history>
<permissions>
<copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2012</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>In this paper, we present a fully-automatic Spatio-Temporal GrabCut human segmentation methodology that combines tracking and segmentation. GrabCut initialization is performed by a HOG-based subject detection, face detection, and skin color model. Spatial information is included by Mean Shift clustering whereas temporal coherence is considered by the historical of Gaussian Mixture Models. Moreover, full face and pose recovery is obtained by combining human segmentation with Active Appearance Models and Conditional Random Fields. Results over public datasets and in a new Human Limb dataset show a robust segmentation and recovery of both face and pose using the presented methodology.</p></abstract>
<kwd-group>
<kwd>segmentation</kwd>
<kwd>human pose recovery</kwd>
<kwd>GrabCut</kwd>
<kwd>GraphCut</kwd>
<kwd>Active Appearance Models</kwd>
<kwd>Conditional Random Field</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Human segmentation in uncontrolled environments is a hard task because of the constant changes produced in natural scenes: illumination changes, moving objects, changes in the point of view, occlusions, just to mention a few. Because of the nature of the problem, a common way to proceed is to discard most part of the image so that the analysis can be performed on a reduced set of small candidate regions. In [<xref ref-type="bibr" rid="b1-sensors-12-15376">1</xref>], the authors propose a full-body detector based on a cascade of classifiers [<xref ref-type="bibr" rid="b2-sensors-12-15376">2</xref>] using HOG features. This methodology is currently being used in several works related to the pedestrian detection problem [<xref ref-type="bibr" rid="b3-sensors-12-15376">3</xref>]. GrabCut [<xref ref-type="bibr" rid="b4-sensors-12-15376">4</xref>] has also shown high robustness in Computer Vision segmentation problems, defining the pixels of the image as nodes of a graph and extracting foreground pixels via iterated Graph Cut optimization. This methodology has been applied to the problem of human body segmentation with high success [<xref ref-type="bibr" rid="b5-sensors-12-15376">5</xref>,<xref ref-type="bibr" rid="b6-sensors-12-15376">6</xref>]. In the case of working with sequences of images, this optimization problem can also be considered to have temporal coherence. In the work of [<xref ref-type="bibr" rid="b7-sensors-12-15376">7</xref>], the authors extended the Gaussian Mixture Model (GMM) of GrabCut algorithm so that the color space is complemented with the derivative in time of pixel intensities in order to include temporal information in the segmentation optimization process. However, the main problem of that method is that moving pixels corresponds to the boundaries between foreground and background regions, and thus, there is no clear discrimination.</p>
<p>Once a region of interest is determined, pose is often recovered by the determination of the body limbs together with their spatial coherence (also with temporal coherence in case of image sequences). Most of these approaches are probabilistic, and features are usually based on edges or “appearance”. In [<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>], the author propose a probabilistic approach for limb detection based on edge learning complemented with color information. The image of probabilities is then formulated in a Conditional Random Field (CRF) scheme and optimized using belief propagation. This work has obtained robust results and has been extended by other authors including local GrabCut segmentation and temporal refinement of the CRF model [<xref ref-type="bibr" rid="b5-sensors-12-15376">5</xref>,<xref ref-type="bibr" rid="b6-sensors-12-15376">6</xref>].</p>
<p>In this paper, we propose a full-automatic Spatio-Temporal GrabCut human segmentation methodology, which benefits from the combination of tracking and segmentation. First, subjects are detected by means of a HOG-based cascade of classifiers. Face detection and skin color model are used to define a set of seeds used to initialize GrabCut algorithm. Spatial information is taken into account by means of Mean Shift clustering, whereas temporal information is considered taking into account the pixel probability membership to an historical of Gaussian Mixture Models. Moreover, the methodology is combined with Shape and Active Appearance Models (AAM) to define three different meshes of the face, one near frontal view, and the other ones near lateral views. Temporal coherence and fitting cost are considered in conjunction with GrabCut segmentation to allow a smooth and robust face fitting in video sequences. Finally, the limb detection and a CRF model are applied on the obtained segmentation, showing high robustness capturing body limbs due to the accurate human segmentation. The main limitation of our approach is that it depends on a correct detection of the person and his/her face, in order to get the desired result. In order to test the proposed methodology, we use public datasets and present a new Human Limb dataset useful for human segmentation, limb detection, and pose recovery purposes.</p>
<p>The rest of the paper is organized as follows: Section 2 describes the proposed methodology, presenting the spatio-temporal GrabCut segmentation, the AAM for face fitting, and the pose recovery methodology. Experimental results on public and novel datasets are performed in Section 3. Finally, Section 4 concludes the paper.</p></sec>
<sec>
<label>2.</label>
<title>Full-Body Pose Recovery</title>
<p>In this section, we present the Spatio-Temporal GrabCut methodology to deal with the problem of automatic human segmentation in video sequences. Then, we describe the Active Appearance Models used to recover the face, and the body pose recovery methodology based on the approach of [<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>]. All methods presented in this section are combined to improve final segmentation and pose recovery. <xref ref-type="fig" rid="f1-sensors-12-15376">Figure 1</xref> illustrates the different modules of the project.</p>
<sec>
<label>2.1.</label>
<title>GrabCut Segmentation</title>
<p>In [<xref ref-type="bibr" rid="b4-sensors-12-15376">4</xref>], the authors proposed an approach to find a binary segmentation(background and foreground) of an image by formulating an energy minimization scheme as the one presented in [<xref ref-type="bibr" rid="b9-sensors-12-15376">9</xref>–<xref ref-type="bibr" rid="b11-sensors-12-15376">11</xref>], extended using color instead of just gray-scale information. Given a color image <italic>I</italic>, let us consider the array <italic>z</italic> = (<italic>z</italic><sub>1</sub>, …, <italic>z<sub>n</sub></italic>, …, <italic>z<sub>N</sub></italic>) of <italic>N</italic> pixels where <italic>z<sub>i</sub></italic> = (<italic>R<sub>i</sub></italic>, <italic>G<sub>i</sub></italic>, <italic>B<sub>i</sub></italic>), <italic>i</italic> ∈ [1, …, <italic>N</italic>] in RGB space. The segmentation is defined as array <bold><italic>α</italic></bold> = (<italic>α</italic><sub>1</sub>, …<italic>α<sub>N</sub></italic>), <italic>α<sub>i</sub></italic> ∈ {0, 1}, assigning a label to each pixel of the image indicating if it belongs to background or foreground. A trimap <italic>T</italic> is defined by the user—in a semi-automatic way—consisting of three regions: <italic>T<sub>B</sub></italic>, <italic>T<sub>F</sub></italic> and <italic>T<sub>U</sub></italic>, each one containing initial background, foreground, and uncertain pixels, respectively. Pixels belonging to <italic>T<sub>B</sub></italic> and <italic>T<sub>F</sub></italic> are clamped as background and foreground respectively—which means GrabCut will not be able to modify these labels, whereas those belonging to <italic>T<sub>U</sub></italic> are actually the ones the algorithm will be able to label. Color information is introduced by GMMs. A full covariance GMM of <italic>K</italic> components is defined for background pixels (<italic>α<sub>i</sub></italic> = 0), and another one for foreground pixels (<italic>α<sub>j</sub></italic> = 1), parametrized as follows
<disp-formula id="FD1">
<label>(1)</label>
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:mi mathvariant="bold-italic">θ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>μ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>∑</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mo>{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>}</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>‥</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo>}</mml:mo>
<mml:mo>,</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>being <italic>π</italic> the weights, <italic>μ</italic> the means and Σ the covariance matrices of the model. We also consider the array <bold>k</bold> = {<italic>k</italic><sub>1</sub>, …, <italic>k<sub>i</sub></italic>, …<italic>k<sub>N</sub></italic>}, <italic>k<sub>i</sub></italic> ∈ {1, …<italic>K</italic>}, <italic>i</italic> ∈ [1, …, <italic>N</italic>] indicating the component of the background or foreground GMM (according to <italic>α<sub>i</sub></italic>) the pixel <italic>z<sub>i</sub></italic> belongs to. The energy function for segmentation is then
<disp-formula id="FD2">
<label>(2)</label>
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>θ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">U</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>θ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="bold">V</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>where <bold>U</bold> is the likelihood potential, based on the probability distributions <italic>p</italic>(·) of the GMM:
<disp-formula id="FD3">
<label>(3)</label>
<mml:math id="mm3" display="block">
<mml:semantics id="sm3">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold-italic">α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">θ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo mathvariant="bold">∑</mml:mo>
<mml:mi>i</mml:mi></mml:munder>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mo>log</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">θ</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>log</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>and <italic>V</italic> is a regularizing prior assuming that segmented regions should be coherent in terms of color, taking into account a neighborhood <italic>C</italic> around each pixel
<disp-formula id="FD4">
<label>(4)</label>
<mml:math id="mm4" display="block">
<mml:semantics id="sm4">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold-italic">α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>γ</mml:mi>
<mml:munder>
<mml:mo mathvariant="bold">∑</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>}</mml:mo>
<mml:mo>∈</mml:mo>
<mml:mi>C</mml:mi></mml:mrow></mml:munder>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>n</mml:mi></mml:msub>
<mml:mo>≠</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>m</mml:mi></mml:msub>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>exp</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi>β</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>‖</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>m</mml:mi></mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>n</mml:mi></mml:msub></mml:mrow>
<mml:mo>‖</mml:mo></mml:mrow></mml:mrow>
<mml:mn>2</mml:mn></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>With this energy minimization scheme and given the initial trimap <italic>T</italic>, the final segmentation is performed using a minimum cut algorithm [<xref ref-type="bibr" rid="b9-sensors-12-15376">9</xref>,<xref ref-type="bibr" rid="b10-sensors-12-15376">10</xref>,<xref ref-type="bibr" rid="b12-sensors-12-15376">12</xref>]. The classical semi-automatic GrabCut algorithm is summarized in Algorithm 1.
<array>
<tbody>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Algorithm 1 Original GrabCut algorithm.</bold></td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td align="right" valign="top"> 1:</td>
<td align="left" valign="top">Trimap <italic>T</italic> initialization with manual annotation.</td></tr>
<tr>
<td align="right" valign="top"> 2:</td>
<td align="left" valign="top">Initialize <italic>α<sub>i</sub></italic> = 0 for <italic>i</italic> ∈ <italic>T<sub>B</sub></italic> and <italic>α<sub>i</sub></italic> = 1 for <italic>i</italic> ∈ <italic>T<sub>U</sub></italic> ∪ <italic>T<sub>F</sub></italic>.</td></tr>
<tr>
<td align="right" valign="top"> 3:</td>
<td align="left" valign="top">Initialize Background and Foreground GMMs from sets <italic>α<sub>i</sub></italic> = 0 and <italic>α</italic><sub>i</sub> = 1 respectively, with <italic>k</italic>-means.</td></tr>
<tr>
<td align="right" valign="top"> 4:</td>
<td align="left" valign="top">Assign GMM components to pixels.</td></tr>
<tr>
<td align="right" valign="top"> 5:</td>
<td align="left" valign="top">Learn GMM parameters from data z.</td></tr>
<tr>
<td align="right" valign="top"> 6:</td>
<td align="left" valign="top">Estimate segmentation: Graph-cuts.</td></tr>
<tr>
<td align="right" valign="top"> 7:</td>
<td align="left" valign="top">Repeat from step 4, until convergence.</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr></tbody></array></p></sec>
<sec>
<label>2.2.</label>
<title>Automatic Initialization</title>
<p>Our proposal is based on the previous GrabCut framework, focusing on human body segmentation, being fully automatic, and extending it by taking into account temporal coherence. We refer to each frame of the video as <italic>f<sub>t</sub></italic>, <italic>t</italic> ∈ {1, …, <italic>M</italic>} being <italic>M</italic> the length of the sequence. Given a frame <italic>f<sub>t</sub></italic>, we first apply a person detector based on a cascade of classifiers using HOG features [<xref ref-type="bibr" rid="b1-sensors-12-15376">1</xref>]. Then, we initialize the trimap <italic>T</italic> from the bounding box <italic>B</italic> retuned by the detector: <italic>T<sub>U</sub></italic> = {<italic>z<sub>i</sub></italic> ∈ <italic>B</italic>}, <italic>T<sub>B</sub></italic> = {<italic>z<sub>i</sub></italic> ∉ <italic>B</italic>}. Furthermore, in order to increase the accuracy of the segmentation algorithm, we include Foreground seeds exploiting spatial and appearance prior information. On one hand, we define a small central rectangular region <italic>R</italic> inside <italic>B</italic>, proportional to <italic>B</italic> in such a way that we are sure it corresponds to the person. Thus, pixels inside <italic>R</italic> are set to foreground. On the other, we apply a face detector based on a cascade of classifiers using Haar-like features [<xref ref-type="bibr" rid="b2-sensors-12-15376">2</xref>] over <italic>B</italic>, and learn a skin color model <italic>h<sub>skin</sub></italic> consisting of a histogram over the <italic>Hue</italic> channel of the <italic>HSV</italic> image representation. All pixels inside <italic>B</italic> fitting in <italic>h<sub>skin</sub></italic> are also set to foreground. Therefore, we initialize <italic>T<sub>F</sub></italic> = {<italic>z<sub>i</sub></italic> ∈ <italic>R</italic>} ∪ {<italic>z<sub>i</sub></italic> ∈ <italic>δ</italic>(<italic>z<sub>i</sub></italic>, <italic>h<sub>skin</sub></italic>)}, where <italic>δ</italic> returns the set of pixels belonging to the color model defined by <italic>h<sub>skin</sub></italic>. An example of seed initialization is shown in <xref ref-type="fig" rid="f2-sensors-12-15376">Figure 2(b)</xref>.</p></sec>
<sec>
<label>2.3.</label>
<title>Spatial Extension</title>
<p>Once we have initialized the trimap, we can apply the iterative minimization algorithm shown in steps 4 to 7 of original GrabCut (Algorithm 1). However, instead of applying k-means for the initialization of the GMMs we propose to use Mean-Shift clustering, which also takes into account spatial coherence. Given an initial estimation of the distribution modes <italic>m<sub>h</sub></italic>(<bold>x</bold><sup>0</sup>) and a kernel function <italic>g</italic>, Mean-shift iteratively updates the mean-shift vector with the following formula:
<disp-formula id="FD5">
<label>(5)</label>
<mml:math id="mm5" display="block">
<mml:semantics id="sm5">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">m</mml:mi>
<mml:mi>h</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:msubsup>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>‖</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mi>i</mml:mi></mml:msub></mml:mrow>
<mml:mi>h</mml:mi></mml:mfrac></mml:mrow>
<mml:mo>‖</mml:mo></mml:mrow></mml:mrow>
<mml:mn>2</mml:mn></mml:msup></mml:mrow>
<mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:msubsup>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>‖</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mi>i</mml:mi></mml:msub></mml:mrow>
<mml:mi>h</mml:mi></mml:mfrac></mml:mrow>
<mml:mo>‖</mml:mo></mml:mrow></mml:mrow>
<mml:mn>2</mml:mn></mml:msup></mml:mrow>
<mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>until it converges, where <bold>x</bold><italic><sub>i</sub></italic> contains the value of pixel <italic>z<sub>i</sub></italic> in CIELuv space and its spatial coordinates, and returns the centers of the clusters (distribution modes) found. After convergence, we obtain a segmentation <bold><italic>α</italic></bold><italic><sup>t</sup></italic> and the updated foreground and background GMMs <bold><italic>θ</italic></bold><italic><sup>t</sup></italic> at frame <italic>f<sub>t</sub></italic>, which are used for further initialization at frame <italic>f<sub>t</sub></italic><sub>+1</sub>. The result of this step is shown in <xref ref-type="fig" rid="f2-sensors-12-15376">Figure 2(c)</xref>. Finally, we refine the segmentation of frame <italic>f<sub>t</sub></italic> eliminating false positive foreground pixels. By definition of the energy minimization scheme, GrabCut tends to find convex segmentation masks having a lower perimeter, given that each pixel on the boundary of the segmentation mask contributes on the global cost. Therefore, in order to eliminate these background pixels (commonly in concave regions) from the foreground segmentation, we re-initialize the trimap <italic>T</italic> as follows
<disp-formula id="FD6">
<label>(6)</label>
<mml:math id="mm6" display="block">
<mml:semantics id="sm6">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>B</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo>∪</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mi mathvariant="bold">∑</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi>j</mml:mi></mml:mrow>
<mml:mi>t</mml:mi></mml:munderover>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold-italic">θ</mml:mi>
<mml:mi>k</mml:mi></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow>
<mml:mi>j</mml:mi></mml:mfrac>
<mml:mo>&gt;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mi mathvariant="bold">∑</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi>j</mml:mi></mml:mrow>
<mml:mi>t</mml:mi></mml:munderover>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold-italic">θ</mml:mi>
<mml:mi>k</mml:mi></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow>
<mml:mi>j</mml:mi></mml:mfrac></mml:mrow>
<mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>δ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">skin</mml:mtext></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>U</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>}</mml:mo>
<mml:mo>\</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>B</mml:mi></mml:msub>
<mml:mo>\</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:semantics></mml:math></disp-formula>where the pixel background probability membership is computed using the GMM models of previous <italic>j</italic> segmentations. This formulation can also be extended to detect false negatives. However, in our case we focus on false positives since they appear frequently in the case of human segmentation. The result of this step is shown in <xref ref-type="fig" rid="f2-sensors-12-15376">Figure 2(d)</xref>. Once the trimap has been redefined, false positive foreground pixels still remain, so the new set of seeds is used to iterate again GrabCut algorithm, resulting in a more accurate segmentation, as we can see in <xref ref-type="fig" rid="f2-sensors-12-15376">Figure 2(e)</xref>.</p></sec>
<sec>
<label>2.4.</label>
<title>Temporal Extension</title>
<p>Considering <italic>A</italic> as the binary image representing <bold><italic>α</italic></bold> at <italic>f<sub>t</sub></italic> (the one obtained before the refinement), we initialize the trimap for <italic>f<sub>t</sub></italic><sub>+1</sub> as follows
<disp-formula id="FD7">
<label>(7)</label>
<mml:math id="mm7" display="block">
<mml:semantics id="sm7">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>⊖</mml:mo>
<mml:mi>S</mml:mi>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>e</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>U</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>⊕</mml:mo>
<mml:mi>S</mml:mi>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>d</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>}</mml:mo>
<mml:mo>\</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>B</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>=</mml:mo></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>I</mml:mi>
<mml:mo>}</mml:mo>
<mml:mo>\</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi></mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>U</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:semantics></mml:math></disp-formula>where ⊖ and ⊕ are erosion and dilation operations with their respective structuring elements <italic>ST<sub>e</sub></italic> and <italic>ST<sub>d</sub></italic>, <italic>α<sub>i</sub></italic> := <italic>α</italic>(<italic>z<sub>i</sub></italic>), and \ represents the set difference operation. The structuring elements are simple squares of a given size depending on the size of the person and the degree of movement we allow from <italic>f<sub>t</sub></italic> to <italic>f<sub>t</sub></italic><sub>+1</sub>, assuming smoothness in the movement of the person. An example of a morphological mask is shown in <xref ref-type="fig" rid="f2-sensors-12-15376">Figure 2(f)</xref>. Spatial information could be also included in the mean-shift algorithm in conjunction with color and spatial information. However, we included this information explicitly to be anisotropic. The whole segmentation methodology is detailed in the ST-GrabCut Algorithm 2.
<array>
<tbody>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Algorithm 2 Spatio-Temporal GrabCut algorithm.</bold></td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td align="right" valign="top"> 1:</td>
<td align="left" valign="top">Person detection on <italic>f</italic><sub>1</sub>.</td></tr>
<tr>
<td align="right" valign="top"> 2:</td>
<td align="left" valign="top">Face detection and skin color model learning.</td></tr>
<tr>
<td align="right" valign="top"> 3:</td>
<td align="left" valign="top">Trimap <italic>T</italic> initialization with detected bounding box and learnt skin color model.</td></tr>
<tr>
<td align="right" valign="top"> 4:</td>
<td align="left" valign="top">Initialize <italic>α<sub>i</sub></italic> = 0 for <italic>i</italic> ∈ <italic>T<sub>B</sub></italic> and <italic>α<sub>i</sub></italic> = 1 for <italic>i</italic> ∈ <italic>T<sub>U</sub></italic> ∪ <italic>T<sub>F</sub></italic>.</td></tr>
<tr>
<td align="right" valign="top"> 5:</td>
<td align="left" valign="top">Initialize Background and Foreground GMMs from sets <italic>α<sub>i</sub></italic> = 0 and <italic>α<sub>i</sub></italic> = 1 respectively, with Mean-shift.</td></tr>
<tr>
<td align="right" valign="top"> 6:</td>
<td align="left" valign="top"><bold>for</bold> <italic>t</italic> = 1 … <italic>M</italic></td></tr>
<tr>
<td align="right" valign="top"> 7:</td>
<td align="left" valign="top"> Person detection on <italic>f<sub>t</sub></italic>.</td></tr>
<tr>
<td align="right" valign="top"> 8:</td>
<td align="left" valign="top"> Assign GMM components to pixels of <italic>f<sub>t</sub></italic>.</td></tr>
<tr>
<td align="right" valign="top"> 9:</td>
<td align="left" valign="top"> Learn GMM parameters from data z.</td></tr>
<tr>
<td align="right" valign="top"> 10:</td>
<td align="left" valign="top"> Estimate segmentation: Graph-cuts.</td></tr>
<tr>
<td align="right" valign="top"> 11:</td>
<td align="left" valign="top"> Repeat from step 8, until convergence.</td></tr>
<tr>
<td align="right" valign="top"> 12:</td>
<td align="left" valign="top"> Re-initialize trimap <italic>T</italic> (<xref rid="FD6" ref-type="disp-formula">Equation (6)</xref>).</td></tr>
<tr>
<td align="right" valign="top"> 13:</td>
<td align="left" valign="top"> Assign GMM components to pixels.</td></tr>
<tr>
<td align="right" valign="top"> 14:</td>
<td align="left" valign="top"> Learn GMM parameters from data z.</td></tr>
<tr>
<td align="right" valign="top"> 15:</td>
<td align="left" valign="top"> Estimate segmentation: Graph-cuts.</td></tr>
<tr>
<td align="right" valign="top"> 16:</td>
<td align="left" valign="top"> Repeat from step 12, until convergence.</td></tr>
<tr>
<td align="right" valign="top"> 17:</td>
<td align="left" valign="top"> Initialize trimap <italic>T</italic> using segmentation obtained in step 11 after convergence (<xref rid="FD7" ref-type="disp-formula">equation 7</xref>) for <italic>f<sub>t</sub></italic><sub>+1</sub>.</td></tr>
<tr>
<td align="right" valign="top"> 18:</td>
<td align="left" valign="top"><bold>end for</bold></td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr></tbody></array></p></sec>
<sec>
<label>2.5.</label>
<title>Face Fitting</title>
<p>Once we have properly segmented the body region, the next step consists of fitting the face and the body limbs. For the case of face recovery, we base our procedure on mesh fitting using AAM, combining Active Shape Models and color and texture information [<xref ref-type="bibr" rid="b13-sensors-12-15376">13</xref>].</p>
<p>AAM is generated by combining a model of shape and texture variation. First, a set of points are marked on the face of the training images that are aligned, and a statistical shape model is build [<xref ref-type="bibr" rid="b14-sensors-12-15376">14</xref>]. Each training image is warped so the points match those of the mean shape. This is raster scanned into a texture vector, <bold>g</bold>, which is normalized by applying a linear transformation, <bold>g</bold> ↦ (<bold>g</bold> − <italic>μ<sub>g</sub></italic><bold>1</bold>)/<italic>σ<sub>g</sub></italic>, where <bold>1</bold> is a vector of ones, and <italic>μ<sub>g</sub></italic> and <italic>σ<sub>g</sub></italic> are the mean and variance of elements of <bold>g</bold>. After normalization, <bold>g</bold><italic><sup>T</sup></italic><bold>1</bold> = 0 and |<bold>g</bold>| = 1. Then, principal component analysis is applied to build a texture model. Finally, the correlations between shape and texture are learnt to generate a combined appearance model. The appearance model has parameter <bold>c</bold> controlling the shape and texture according to
<disp-formula id="FD8">
<label>(8)</label>
<mml:math id="mm8" display="block">
<mml:semantics id="sm8">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>¯</mml:mo></mml:mover>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>s</mml:mi></mml:msub>
<mml:mi mathvariant="bold">c</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>
<disp-formula id="FD9">
<label>(9)</label>
<mml:math id="mm9" display="block">
<mml:semantics id="sm9">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mo>=</mml:mo>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo>¯</mml:mo></mml:mover>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>g</mml:mi></mml:msub>
<mml:mi mathvariant="bold">c</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>x̄</italic> is the mean shape, <italic>ḡ</italic> the mean texture in a mean shaped patch, and <bold>Q</bold><italic><sub>s</sub></italic>, <bold>Q</bold><italic><sub>g</sub></italic> are matrices designing the modes of variation derived from the training set. A shape <bold>X</bold> in the image frame can be generated by applying a suitable transformation to the points, <bold>x : X</bold> = <italic>S<sub>t</sub></italic>(<bold>x</bold>). Typically, <italic>S<sub>t</sub></italic> will be a similarity transformation described by a scaling <italic>s</italic>, an in-plane rotation, <italic>θ</italic>, and a translation (<italic>t<sub>x</sub></italic>, <italic>t<sub>y</sub></italic>).</p>
<p>Once constructed the AAM, it is deformed on the image to detect and segment the face appearance as follows. During matching, we sample the pixels in the region of interest <bold><italic>g</italic></bold><italic><sub>im</sub></italic> = <italic>T<sub>u</sub></italic>(<bold>g</bold>) = (<italic>u</italic><sub>1</sub> + 1)<bold>g</bold><italic><sub>im</sub></italic> + <italic>u</italic><sub>2</sub><bold>1</bold>, where <bold>u</bold> is the vector of transformation parameters, and project into the texture model frame, 
<inline-formula>
<mml:math id="mm10" display="inline">
<mml:semantics id="sm10">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi>T</mml:mi>
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula>. The current model texture is given by <bold>g</bold><italic><sub>m</sub></italic> = <italic>ḡ</italic> + <bold>Q</bold><italic><sub>g</sub></italic><bold>c</bold>, and the difference between model and image (measured in the normalized texture frame) is as follows
<disp-formula id="FD10">
<label>(10)</label>
<mml:math id="mm11" display="block">
<mml:semantics id="sm11">
<mml:mrow>
<mml:mi mathvariant="bold">r</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">g</mml:mi>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">g</mml:mi>
<mml:mi>m</mml:mi></mml:msub></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>Given the error <italic>E</italic> = |<bold>r</bold>|<sup>2</sup>, we compute the predicted displacements <italic>δ</italic><bold>p</bold> = −<bold>Rr</bold>(<bold>p</bold>), where 
<inline-formula>
<mml:math id="mm12" display="inline">
<mml:semantics id="sm12">
<mml:mrow>
<mml:mi mathvariant="bold">R</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">r</mml:mi>
<mml:mi>T</mml:mi></mml:msup></mml:mrow>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi></mml:mrow></mml:mfrac>
<mml:mfrac>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:mi mathvariant="bold">r</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi></mml:mrow></mml:mfrac></mml:mrow>
<mml:mo>)</mml:mo></mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msup>
<mml:mfrac>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">r</mml:mi>
<mml:mi>T</mml:mi></mml:msup></mml:mrow>
<mml:mrow>
<mml:mo>∂</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></inline-formula>. The model parameters are updated <bold>p</bold> ↦ <bold>p</bold> + <italic>kδ</italic><bold>p</bold>, where initially <italic>k</italic> = 1. The new points <bold>X</bold>′ and model frame texture 
<inline-formula>
<mml:math id="mm13" display="inline">
<mml:semantics id="sm13">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">g</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo>′</mml:mo></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula> are estimated, and the image is sampled at the new points to obtain 
<inline-formula>
<mml:math id="mm14" display="inline">
<mml:semantics id="sm14">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">g</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi></mml:mrow>
<mml:mo>′</mml:mo></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula> and the new error vector 
<inline-formula>
<mml:math id="mm15" display="inline">
<mml:semantics id="sm15">
<mml:mrow>
<mml:mi mathvariant="bold">r</mml:mi>
<mml:mo>′</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>′</mml:mo></mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi></mml:mrow>
<mml:mo>′</mml:mo></mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo>′</mml:mo></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula>. A final condition guides the end of each iteration: if |<bold>r</bold>′|<sup>2</sup> &lt; <italic>E</italic>, then we accept the new estimate, otherwise, we set to <italic>k</italic> = 0.5, <italic>k</italic> = 0.25, and so on. The procedure is repeated until no improvement is made to the error.</p>
<p>With the purpose to discretize the head pose between frontal face and profile face, we create three AAM models corresponding to the frontal, right, and left view. Aligning every mesh of the model, we obtain the mean of the model. Finally, to determine the class of a fitted face by AAM models, that is given by its proximity to the closest mean model.</p>
<p>Taking into account the discontinuity that appears when a face moves from frontal to profile view, we use three different AAM corresponding to three meshes of 21 points: frontal view ℑ<italic><sub>F</sub></italic>, right lateral view ℑ<italic><sub>R</sub></italic>, and left lateral view ℑ<italic><sub>L</sub></italic>. In order to include temporal and spatial coherence, meshes at frame <italic>f<sub>t</sub></italic><sub>+1</sub> are initialized by the fitted mesh points at frame <italic>f<sub>t</sub></italic>. Additionally, we include a temporal change-mesh control procedure, as follows
<disp-formula id="FD11">
<label>(11)</label>
<mml:math id="mm16" display="block">
<mml:semantics id="sm16">
<mml:mrow>
<mml:msup>
<mml:mi>ℑ</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo>min</mml:mo></mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>ℑ</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msub>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>ℑ</mml:mi>
<mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mi>ℑ</mml:mi>
<mml:mi>R</mml:mi></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mi>ℑ</mml:mi>
<mml:mi>L</mml:mi></mml:mrow></mml:msub>
<mml:mo>}</mml:mo>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>ℑ</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msup>
<mml:mo>∈</mml:mo>
<mml:mi>ν</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mi>ℑ</mml:mi>
<mml:mi>t</mml:mi></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>ν</italic>(ℑ<italic><sup>t</sup></italic>) corresponds to the meshes contiguous to the mesh ℑ<italic><sup>t</sup></italic> fitted at time <italic>t</italic> (including the same mesh), and <italic>E</italic><sub>ℑ</sub><italic><sub>i</sub></italic> is the fitting error cost of mesh ℑ<italic><sub>i</sub></italic>. This constraint avoids false jumps and imposes smoothness in the temporal face behavior (e.g., a jump from right to left profile view is not allowed).</p>
<p>In order to obtain more accurate pose estimation, after fitting the mesh, we take advantage of its variability to differentiate among a set of head poses. Analyzing the spatial configuration of the 21 landmarks that composes a mesh, we create a new training set divided in five classes. We define five different head poses as follows: right, middle-right, frontal, middle-left, and left. In the training process, every mesh has been aligned, and PCA is applied to save the 20 most representative eigenvectors. Then, a new image is projected to that new space and classified to one of the five different head poses according to a 3-Nearest Neighbor rule.</p>
<p><xref ref-type="fig" rid="f3-sensors-12-15376">Figure 3</xref> shows examples of the AAM model fitting and pose estimation in images (obtained from [<xref ref-type="bibr" rid="b15-sensors-12-15376">15</xref>]) for the five different head poses.</p></sec>
<sec>
<label>2.6.</label>
<title>Pose Recovery</title>
<p>Considering the refined segmented body region obtained using the proposed ST-GrabCut algorithm, we construct a pictorial structure model [<xref ref-type="bibr" rid="b16-sensors-12-15376">16</xref>]. We use the method of Ramanan [<xref ref-type="bibr" rid="b6-sensors-12-15376">6</xref>,<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>], which captures the appearance and spatial configuration of body parts. A person's body parts are tied together in a tree-structured conditional random field. Parts, <italic>l<sub>i</sub></italic>, are oriented patches of fixed size, and their position is parameterized by location (<italic>x</italic>, <italic>y</italic>) and orientation <italic>ϕ</italic>. The posterior of a configuration of parts <italic>L</italic> = <italic>l<sub>i</sub></italic> given a frame <italic>f<sub>t</sub></italic> is
<disp-formula id="FD12">
<label>(12)</label>
<mml:math id="mm17" display="block">
<mml:semantics id="sm17">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>∝</mml:mo>
<mml:mo>exp</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>∈</mml:mo>
<mml:mi>E</mml:mi></mml:mrow></mml:munder>
<mml:mrow>
<mml:mi mathvariant="normal">Ψ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo>+</mml:mo>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mi>i</mml:mi></mml:munder>
<mml:mrow>
<mml:mi mathvariant="normal">Φ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow>
<mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>The pair-wise potential Ψ(<italic>l<sub>i</sub></italic>, <italic>l<sub>j</sub></italic>) corresponds to a spatial prior on the relative position of parts and embeds the kinematic constraints. The unary potential Φ(<italic>l<sub>i</sub></italic>∣<italic>I</italic>) corresponds to the local image evidence for a part in a particular position. Inference is performed over tree-structured conditional random field.</p>
<p>Since the appearance of the parts is initially unknown, a first inference uses only edge features in Φ. This delivers soft estimates of body part positions, which are used to build appearance models of the parts and background (color histograms). Inference is then repeated with Φ using both edges and appearance. This parsing technique simultaneously estimates pose and appearance of parts. For each body part, parsing delivers a posterior marginal distribution over location and orientation (<italic>x</italic>, <italic>y</italic>, <italic>ϕ</italic>) [<xref ref-type="bibr" rid="b6-sensors-12-15376">6</xref>,<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>].</p></sec></sec>
<sec sec-type="results">
<label>3.</label>
<title>Results</title>
<p>Before the presentation of the results, we discuss the data, methods and parameters of the comparative, and validation measurements.</p>
<sec>
<title>Data</title>
<p>We use the public image sequences of the Chroma Video Segmentation Ground Truth (cVSG) [<xref ref-type="bibr" rid="b17-sensors-12-15376">17</xref>], a corpus of video sequences and segmentation masks of people. Chroma based techniques have been used to record Foregrounds and Backgrounds separately, being later combined to achieve final video sequences and accurate segmentation masks almost automatically. Some samples of the sequence we have used for testing are shown in <xref ref-type="fig" rid="f4-sensors-12-15376">Figure 4(a)</xref>. The sequence has a total of 307 frames. This image sequence includes several critical factors that make segmentation difficult: object textural complexity, object structure, uncovered extent, object size, Foreground and Background velocity, shadows, background textural complexity, Background multimodality, and small camera motion.</p>
<p>As a second database, we have also used a set of 30 videos corresponding to the defense of undergraduate thesis at the University of Barcelona to test the methodology in a different environment (UBDataset). Some samples of this dataset are shown in <xref ref-type="fig" rid="f4-sensors-12-15376">Figure 4(b)</xref>.</p>
<p>Moreover, we present the Human Limb dataset, a new dataset composed by 227 images from 25 different people. At each image, 14 different limbs are labeled (see <xref ref-type="fig" rid="f4-sensors-12-15376">Figure 4(c)</xref>), including the “do not care” label between adjacent limbs, as described in <xref ref-type="fig" rid="f5-sensors-12-15376">Figure 5</xref>. Backgrounds are from different real environments with different visual complexity. This dataset is useful for human segmentation, limb detection, and pose recovery purposes [<xref ref-type="bibr" rid="b18-sensors-12-15376">18</xref>].</p></sec>
<sec sec-type="methods">
<title>Methods</title>
<p>We test the classical semi-automatic GrabCut algorithm for human segmentation comparing with the proposed ST-GrabCut algorithm. In the case of GrabCut, we set the number of GMM components <italic>k</italic> = 5 for both foreground and background models. Furthermore, the already trained models used for person and face detectors have been taken from the OpenCV 2.1.</p>
<p>We also test the mesh fitting and body pose recovery methodologies on the obtained segmentations. The body model used for the pose recovery was taken directly from the work of [<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>].</p></sec>
<sec>
<title>Validation measurements</title>
<p>In order to evaluate the robustness of the methodology for human body segmentation, face and pose fitting, we use the ground truth masks of the images to compute the overlapping factor <italic>O</italic> as follows
<disp-formula id="FD13">
<label>(13)</label>
<mml:math id="mm18" display="block">
<mml:semantics id="sm18">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi>C</mml:mi></mml:mrow></mml:msub>
<mml:mo>∩</mml:mo>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow>
<mml:mrow>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi>C</mml:mi></mml:mrow></mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>M<sub>GC</sub></italic> and <italic>M<sub>GT</sub></italic> are the binary masks obtained for spatio-temporal GrabCut segmentation and the ground truth mask, respectively.</p></sec>
<sec>
<label>3.1.</label>
<title>Spatio-Tempral GrabCut Segmentation</title>
<p>First, we test the proposed ST-GrabCut segmentation on the sequence from the public cVSG corpus. The results for the different experiments are shown in <xref ref-type="table" rid="t1-sensors-12-15376">Table 1</xref>. In order to avoid the manual initialization of classical GrabCut algorithm, for all the experiments, seed initialization is performed applying the commented person HOG detection, face detection, and skin color model. First row of <xref ref-type="table" rid="t1-sensors-12-15376">Table 1</xref> shows the overlapping performance of <xref rid="FD13" ref-type="disp-formula">Equation (13)</xref> applying GrabCut segmentation with <italic>k</italic>-means clustering to design the GMM models. Second row shows the overlapping performance considering the spatial extension of the algorithm introduced by using Mean Shift clustering (<xref rid="FD5" ref-type="disp-formula">Equation (5)</xref>) to design the GMM models. One can see a slight improvement when using the second strategy. This is mainly because Mean Shift clustering takes into account spatial information of pixels in clustering time, which better defines contiguous pixels of image to belong to GMM models of foreground and background. Third performance in <xref ref-type="table" rid="t1-sensors-12-15376">Table 1</xref> shows the overlapping results adding the temporal extension to the spatial one, considering the morphology refinement based on previous segmentation (<xref rid="FD7" ref-type="disp-formula">Equation (7)</xref>). In this case, we obtain near 10% of performance improvement respect the previous result. Finally, last result of <xref ref-type="table" rid="t1-sensors-12-15376">Table 1</xref> shows the full-automatic ST-GrabCut segmentation overlapping performance taking into account spatio-temporal coherence, and the segmentation refinement introduced in <xref rid="FD6" ref-type="disp-formula">Equation (6)</xref>. One can see that it achieves about 25% of performance improvement in relation with the previous best performance. Some segmentation results obtained by the GrabCut algorithm for the cVSG corpus are shown in <xref ref-type="fig" rid="f6-sensors-12-15376">Figure 6</xref>. Note that the ST-GrabCut segmentation is able to robustly segment convex regions. We have also applied the ST-GrabCut segmentation methodology on the image sequences of UBDataset. Some segmentations are shown in <xref ref-type="fig" rid="f6-sensors-12-15376">Figure 6</xref>.</p></sec>
<sec>
<label>3.2.</label>
<title>Face Fitting</title>
<p>In order to measure the robustness of the spatio-temporal AAM mesh fitting methodology, we performed the overlapping analysis of meshes in both un-segmented and segmented image sequence of the public cVSG corpus. Overlapping results are shown in <xref ref-type="table" rid="t2-sensors-12-15376">Table 2</xref>. One can see that the mesh fitting works fine in unsegmented images, obtaining a final mean overlapping of 89.60%. In this test, we apply HaarCascade face detection implemented and trained by the Open Source Computer Vision library (OpenCv). The face detection method implemented in OpenCV by Rainer Lienhart is very similar to the one published and patented by Paul Viola and Michael Jones, namely called Viola–Jones face detection method [<xref ref-type="bibr" rid="b19-sensors-12-15376">19</xref>]. The classifier is trained with a few hundreds of sample views of a frontal face, that are scaled to the same size (20 × 20), and negative examples of the same size. However, note that combining the temporal information of previous fitting and the ST-GrabCut segmentation, the face mesh fitting considerably improves, obtaining a final of 96.36% of overlapping performance. Some example of face fitting using the AAM meshes for different face poses of the cVSG corpus are shown in <xref ref-type="fig" rid="f7-sensors-12-15376">Figure 7</xref>.</p>
<p>To create three AAM models that represent frontal, right and left views, we have created a training set composed by 1,000 images for each view. The images have been extracted from the public database [<xref ref-type="bibr" rid="b15-sensors-12-15376">15</xref>]. To build three models we manually put 21 landmarks over 500 images for each view. The landmarks of the remaining 500 images which covers one view, has been placed by a semi-automatic process, applying AAM with the set learnt and manually correcting. Finally, we align every resulting mesh and we obtain the mean for each model. As the head pose classifier, to classify the spatial mesh configuration in 5 head poses, we have labeled manually the class of the mesh obtained applying the closest AAM model. Every spatial mesh configuration is represented by the 20 most representative eigenvectors. The training set is formed by 5,000 images from the public database [<xref ref-type="bibr" rid="b15-sensors-12-15376">15</xref>]. Finally, we have tested the classification of the five face poses on the cVSG corpus, obtaining the percentage of frames of the subject at each pose. The obtained percentages are shown in <xref ref-type="table" rid="t3-sensors-12-15376">Table 3</xref>.</p></sec>
<sec>
<label>3.3.</label>
<title>Body Limbs Recovery</title>
<p>Finally, we combine the previous segmentation and face fitting with a full body pose recovery [<xref ref-type="bibr" rid="b8-sensors-12-15376">8</xref>]. In order to show the benefit of applying previous ST-GrabCut segmentation, we perform the overlapping performance of full pose recovery with and without human segmentation, always within the bounding box obtained from HOG person detection. Results are shown in <xref ref-type="table" rid="t4-sensors-12-15376">Table 4</xref>. One can see that pose recovery considerably increases its performance when reducing the region of search based on ST-GrabCut segmentation. Some examples of pose recovery within the human segmentation regions for cVSG corpus and UBdataset are shown in <xref ref-type="fig" rid="f8-sensors-12-15376">Figure 8</xref>. One can see that in most of the cases body limbs are correctly detected. Only in some situations, occlusions or changes in body appearance can produce a wrong limb fitting.</p>
<p>In <xref ref-type="fig" rid="f9-sensors-12-15376">Figure 9</xref> we show the application of the whole framework to perform temporal tracking, segmentation and full face and pose recovery. The colors correspond to the body limbs. The colors increase in intensity based on the instant of time of its detection. One can see the robust detection and temporal coherence based on the smooth displacement of face and limb detections.</p></sec>
<sec>
<label>3.4.</label>
<title>Human Limb Data Set</title>
<p>In this last experiment, we test our methodology on the presented Human Limb dataset. From the 14 total limb annotations, we grouped them into six categories: trunk, up-arms, up-legs, low-arms, low-legs, and head, and we tested the full pose recovery framework. In this case, we tested the body limb recovery with and without applying the ST-GrabCut segmentation, and computed three different overlapping measures: (1) %, which corresponds to the overlapping percentage defined in <xref rid="FD13" ref-type="disp-formula">Equation (13)</xref>; (2) wins, which corresponds to the number of Limb regions with higher overlapping comparing both strategies; (3) match, which corresponds to the number of limb recoveries with overlapping superior to 0.6. The results are shown in <xref ref-type="table" rid="t5-sensors-12-15376">Table 5</xref>. One can see that because of the reduced region where the subjects appear, in most cases there is no significant difference applying the limb recovery procedure with or without previous segmentation. Moreover, the segmentation algorithm is not working at maximum performance due to the same reason, since very small background regions are present in the images, and thus the background color model is quite poor. Furthermore, in this dataset we are working with images, not videos, and for this reason we cannot include the temporal extension in our ST-GrabCut algotithm for this experiment. On the other hand, looking at the mean average overlapping in the last column of the table, one can see that ST-GrabCut improves for all overlapping measures the final limb overlapping. In particular, in the case of the Low-legs recovery is when a more clear improvement appears using ST-GrabCut segmentation. The part of the image corresponding to Low-legs is where more background influence exists, and thus the limb recovery has the highest confusion. However, as ST-GrabCut is able to properly segment the concave regions of the Low-legs regions, a significant improvement is obtained when applying the limb recovery methodology. Some results are illustrated on the images of <xref ref-type="fig" rid="f10-sensors-12-15376">Figure 10</xref>, where the images on the bottom correspond to the improvements obtained using the ST-GrabCut algorithm. Finally, <xref ref-type="fig" rid="f11-sensors-12-15376">Figure 11</xref> show examples of the face fitting methodology applied on the human body limb dataset.</p></sec></sec>
<sec sec-type="conclusions">
<label>4.</label>
<title>Conclusions</title>
<p>In this paper, we presented an evolution of the semi-automatic GrabCut algorithm for dealing with the problem of human segmentation in image sequences. The new full-automatic ST-GrabCut algorithm uses a HOG-based person detector, face detection, and skin color model to initialize GrabCut seeds. Spatial coherence is introduced via Mean Shift clustering, and temporal coherence is considered based on the historical of Gaussian Mixture Models. The segmentation procedure is combined with Shape and Active Appearance models to perform full face and pose recovery.</p>
<p>This general and full-automatic human segmentation, pose recovery, and tracking methodology showed higher performance than classical approaches in public image sequences and a novel Human Limb dataset from uncontrolled environments, which makes it useful for general human face and gesture analysis applications.</p>
<p>One of the limitations of the method is that it depends on the initialization of the ST-GrabCut algorithm, which basically depends on the person and face detectors. Initially, we wait until at least one bounding box is returned by the person detector. This is a critical point, since we will trust the first detection and start segmenting with this hypothesis. In contrast, there is no problem if a further detection is missed, since we initialize the mask with the previous detection (temporal extension). Moreover, due to its sequential application, false seed labeling can accumulate segmentation errors along the video sequence. As the next step, we plan to extend the limb recovery approach so that more complex poses and gestures can be recognized, and feed a gesture recognition system [<xref ref-type="bibr" rid="b20-sensors-12-15376">20</xref>] with the temporal aggregation of the recovered poses along the sequence in order to look for motion patterns of the limbs.</p>
<p>As a future work, the algorithm could be extended in order to segment sequences with more than one person present in the images, since our current method only segments one subject in the scene.</p></sec></body>
<back>
<ack>
<p>This work has been supported in part by projects IMSERSO-Ministerio de Sanidad 2011 Ref. MEDIMINDER, RECERCAIXA 2011 Ref. REMEDI, TIN2009-14404-C02 and CONSOLIDER-INGENIO CSD 2007-00018. The work of Antonio is supported by an FPU fellowship from the Spanish government.</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-sensors-12-15376"><label>1.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Dalal</surname><given-names>N.</given-names></name><name><surname>Triggs</surname><given-names>B.</given-names></name></person-group><article-title>Histogram of Oriented Gradients for Human Detection</article-title><conf-name>Proceedings of CVPR '05: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition</conf-name><conf-loc>San Diego, CA, USA</conf-loc><conf-date>25 June 2005</conf-date><volume>2</volume><fpage>886</fpage><lpage>893</lpage></citation></ref>
<ref id="b2-sensors-12-15376"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Viola</surname><given-names>P.</given-names></name><name><surname>Jones</surname><given-names>M.J.</given-names></name></person-group><article-title>Robust Real-Time Face Detection</article-title><source>Int. J. Comput. Vis.</source><year>2004</year><volume>57</volume><fpage>137</fpage><lpage>154</lpage><pub-id pub-id-type="doi">10.1023/B:VISI.0000013087.49260.fb</pub-id></citation></ref>
<ref id="b3-sensors-12-15376"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geronimo</surname><given-names>D.</given-names></name><name><surname>Lopez</surname><given-names>A.</given-names></name><name><surname>Sappa</surname><given-names>A.</given-names></name></person-group><article-title>Survey of Pedestrian Detection for Advanced Driver Assistance Systems</article-title><source>IEEE Trans. Patt. Anal. Mach. Intell.</source><year>2010</year><volume>32</volume><fpage>1239</fpage><lpage>1258</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2009.122</pub-id></citation></ref>
<ref id="b4-sensors-12-15376"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rother</surname><given-names>C.</given-names></name><name><surname>Kolmogorov</surname><given-names>V.</given-names></name><name><surname>Blake</surname><given-names>A.</given-names></name></person-group><article-title>Grabcut: Interactive Foreground Extraction Using Iterated Graph Cuts</article-title><source>ACM Trans. Graph.</source><year>2004</year><volume>23</volume><fpage>309</fpage><lpage>314</lpage><pub-id pub-id-type="doi">10.1145/1015706.1015720</pub-id></citation></ref>
<ref id="b5-sensors-12-15376"><label>5.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Ferrari</surname><given-names>V.</given-names></name><name><surname>Marin-Jimenez</surname><given-names>M.</given-names></name><name><surname>Zisserman</surname><given-names>A.</given-names></name></person-group><article-title>Progressive Search Space Reduction for Human Pose Estimation</article-title><conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name><conf-loc>Anchorage, AK, USA</conf-loc><conf-date>24–26 June 2008</conf-date></citation></ref>
<ref id="b6-sensors-12-15376"><label>6.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Ferrari</surname><given-names>V.</given-names></name><name><surname>Marin</surname><given-names>M.</given-names></name><name><surname>Zisserman</surname><given-names>A.</given-names></name></person-group><article-title>Pose Search: Retrieving People Using Their Pose</article-title><conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name><conf-loc>Miami, FL, USA</conf-loc><conf-date>20–25 June 2009</conf-date></citation></ref>
<ref id="b7-sensors-12-15376"><label>7.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Corrigan</surname><given-names>D.</given-names></name><name><surname>Robinson</surname><given-names>S.</given-names></name><name><surname>Kokaram</surname><given-names>A.</given-names></name></person-group><article-title>Video Matting Using Motion Extended GrabCut</article-title><conf-name>Proceedings of 5th IET European Conference on Visual Media Production (CVMP)</conf-name><conf-loc>London, UK</conf-loc><conf-date>26–27 November 2008</conf-date></citation></ref>
<ref id="b8-sensors-12-15376"><label>8.</label><citation citation-type="web"><person-group person-group-type="author"><name><surname>Ramanan</surname><given-names>D.</given-names></name></person-group><article-title>Learning to Parse Images of Articulated Bodies</article-title><source>NIPS</source><year>2006</year><comment>Available online: <ext-link xlink:href="http://books.nips.cc/papers/files/nips19/NIPS2006_0899.pdf" ext-link-type="uri">http://books.nips.cc/papers/files/nips19/NIPS2006_0899.pdf</ext-link> (accessed on 8 November 2012)</comment></citation></ref>
<ref id="b9-sensors-12-15376"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Boykov</surname><given-names>Y.Y.</given-names></name><name><surname>Jolly</surname><given-names>M.P.</given-names></name></person-group><article-title>Interactive Graph Cuts for Optimal Boundary &amp; Region Segmentation of Objects in N-D Images</article-title><conf-name>Proceedings of ICCV 2001: Eighth IEEE International Conference on Computer Vision</conf-name><conf-loc>Vancouver, BC, Canada</conf-loc><conf-date>7–14 July 2001</conf-date></citation></ref>
<ref id="b10-sensors-12-15376"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boykov</surname><given-names>Y.</given-names></name><name><surname>Funka-Lea</surname><given-names>G.</given-names></name></person-group><article-title>Graph Cuts and Efficient N-D Image Segmentation</article-title><source>Int. J. Comput. Vis.</source><year>2006</year><volume>70</volume><fpage>109</fpage><lpage>131</lpage><pub-id pub-id-type="doi">10.1007/s11263-006-7934-5</pub-id></citation></ref>
<ref id="b11-sensors-12-15376"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kolmogorov</surname><given-names>V.</given-names></name><name><surname>Zabih</surname><given-names>R.</given-names></name></person-group><article-title>What Energy Functions can be Minimized via Graph Cuts</article-title><source>IEEE Trans. Patt. Anal. Mach. Intell.</source><year>2004</year><volume>26</volume><fpage>65</fpage><lpage>81</lpage></citation></ref>
<ref id="b12-sensors-12-15376"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boykov</surname><given-names>Y.</given-names></name><name><surname>Kolmogorov</surname><given-names>V.</given-names></name></person-group><article-title>An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision</article-title><source>IEEE Trans. Patt. Anal. Mach. Intell.</source><year>2001</year><volume>26</volume><fpage>359</fpage><lpage>374</lpage></citation></ref>
<ref id="b13-sensors-12-15376"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cootes</surname><given-names>T.</given-names></name><name><surname>Edwards</surname><given-names>J.</given-names></name><name><surname>Taylor</surname><given-names>C.</given-names></name></person-group><article-title>Active Appearance Models</article-title><source>IEEE Trans. Patt. Anal. Mach. Intell.</source><year>2001</year><volume>23</volume><fpage>681</fpage><lpage>685</lpage><pub-id pub-id-type="doi">10.1109/34.927467</pub-id></citation></ref>
<ref id="b14-sensors-12-15376"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cootes</surname><given-names>T.</given-names></name><name><surname>Taylor</surname><given-names>C.</given-names></name><name><surname>Cooper</surname><given-names>D.</given-names></name><name><surname>Graham</surname><given-names>J.</given-names></name></person-group><article-title>Active Shape Models—Their Training and Application</article-title><source>Comput. Vis. Image Understand.</source><year>1995</year><volume>61</volume><fpage>38</fpage><lpage>59</lpage><pub-id pub-id-type="doi">10.1006/cviu.1995.1004</pub-id></citation></ref>
<ref id="b15-sensors-12-15376"><label>15.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname><given-names>G.B.</given-names></name><name><surname>Ramesh</surname><given-names>M.</given-names></name><name><surname>Berg</surname><given-names>T.</given-names></name><name><surname>Learned-Miller</surname><given-names>E.</given-names></name></person-group><source>Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments</source><comment>Technical Report 07-492007</comment><publisher-name>University of Massachusetts</publisher-name><publisher-loc>Amherst, MA, USA</publisher-loc><year>2007</year></citation></ref>
<ref id="b16-sensors-12-15376"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Felzenszwalb</surname><given-names>P.</given-names></name><name><surname>Huttenlocher</surname><given-names>D.</given-names></name></person-group><article-title>Pictorial Structures for Object Recognition</article-title><source>Int. J. Comput. Vis.</source><year>2005</year><volume>61</volume><fpage>55</fpage><lpage>79</lpage><pub-id pub-id-type="doi">10.1023/B:VISI.0000042934.15159.49</pub-id></citation></ref>
<ref id="b17-sensors-12-15376"><label>17.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Tiburzi</surname><given-names>F.</given-names></name><name><surname>Escudero</surname><given-names>M.</given-names></name><name><surname>Bescos</surname><given-names>J.</given-names></name><name><surname>Martinez</surname><given-names>J.</given-names></name></person-group><article-title>A Ground-Truth for Motion-Based Video-Object Segmentation</article-title><conf-name>Proceedings of IEEE International Conference on Image Processing (Workshop on Multimedia Information Retrieval)</conf-name><conf-loc>San Diego, CA, USA</conf-loc><conf-date>12–15 October 2008</conf-date></citation></ref>
<ref id="b18-sensors-12-15376"><label>18.</label><citation citation-type="web"><person-group person-group-type="author"><collab>Human Limb dataset</collab></person-group><comment>Availbel online: <ext-link xlink:href="http://www.maia.ub.es/%7Esergio/linked/humanlimbdb.zip" ext-link-type="uri">http://www.maia.ub.es/%7Esergio/linked/humanlimbdb.zip</ext-link> (accessed on 8 November 2012)</comment></citation></ref>
<ref id="b19-sensors-12-15376"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Viola</surname><given-names>P.</given-names></name><name><surname>Jones</surname><given-names>M.J.</given-names></name></person-group><article-title>Robust Real-Time Face Detection</article-title><source>Inte. J. Comput. Vision</source><year>2004</year><volume>57</volume><fpage>137</fpage><lpage>154</lpage><pub-id pub-id-type="doi">10.1023/B:VISI.0000013087.49260.fb</pub-id></citation></ref>
<ref id="b20-sensors-12-15376"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alon</surname><given-names>J.</given-names></name><name><surname>Athitsos</surname><given-names>V.</given-names></name><name><surname>Yuan</surname><given-names>Q.</given-names></name><name><surname>Sclaroff</surname><given-names>S.</given-names></name></person-group><article-title>A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation</article-title><source>IEEE Trans. Pattern Anal. Mach. Intell.</source><year>2009</year><volume>31</volume><fpage>1685</fpage><lpage>1699</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2008.203</pub-id><pub-id pub-id-type="pmid">19574627</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-sensors-12-15376" position="float">
<label>Figure 1.</label>
<caption>
<p>Overall block diagram of the methodology.</p></caption>
<graphic xlink:href="sensors-12-15376f1.gif"/></fig>
<fig id="f2-sensors-12-15376" position="float">
<label>Figure 2.</label>
<caption>
<p>STGrabcut pipeline example: (<bold>a</bold>) Original frame, (<bold>b</bold>) Seed initialization, (<bold>c</bold>) GrabCut, (<bold>d</bold>) Probabilistic re-assignment, (<bold>e</bold>) Refinement and (<bold>f</bold>) Initialization mask for <italic>f<sub>t</sub></italic><sub>+1</sub></p></caption>
<graphic xlink:href="sensors-12-15376f2.gif"/></fig>
<fig id="f3-sensors-12-15376" position="float">
<label>Figure 3.</label>
<caption>
<p>From left to right: left, middle-left, frontal, middle-right and right mesh fitting.</p></caption>
<graphic xlink:href="sensors-12-15376f3.gif"/></fig>
<fig id="f4-sensors-12-15376" position="float">
<label>Figure 4.</label>
<caption>
<p>(<bold>a</bold>) Samples of the cVSG corpus and (<bold>b</bold>) UBDataset image sequences, and (<bold>c</bold>) HumanLimb dataset.</p></caption>
<graphic xlink:href="sensors-12-15376f4.gif"/></fig>
<fig id="f5-sensors-12-15376" position="float">
<label>Figure 5.</label>
<caption>
<p>Human Limb dataset labels description.</p></caption>
<graphic xlink:href="sensors-12-15376f5.gif"/></fig>
<fig id="f6-sensors-12-15376" position="float">
<label>Figure 6.</label>
<caption>
<p>Segmentation examples of (<bold>a</bold>) UBDataset sequence 1, (<bold>b</bold>) UBDataset sequence 2 and (<bold>c</bold>) cVSG sequence.</p></caption>
<graphic xlink:href="sensors-12-15376f6.gif"/></fig>
<fig id="f7-sensors-12-15376" position="float">
<label>Figure 7.</label>
<caption>
<p>Samples of the segmented cVSG corpus image sequences fitting the different AAM meshes.</p></caption>
<graphic xlink:href="sensors-12-15376f7.gif"/></fig>
<fig id="f8-sensors-12-15376" position="float">
<label>Figure 8.</label>
<caption>
<p>Pose recovery results in cVSG sequence.</p></caption>
<graphic xlink:href="sensors-12-15376f8.gif"/></fig>
<fig id="f9-sensors-12-15376" position="float">
<label>Figure 9.</label>
<caption>
<p>Application of the whole framework (pose and face recovery) on an image sequence.</p></caption>
<graphic xlink:href="sensors-12-15376f9.gif"/></fig>
<fig id="f10-sensors-12-15376" position="float">
<label>Figure 10.</label>
<caption>
<p>Human Limb dataset results. Up row: limb recovery without ST-GrabCut segmentation. Down row: limb recovery with ST-GrabCut segmentation.</p></caption>
<graphic xlink:href="sensors-12-15376f10.gif"/></fig>
<fig id="f11-sensors-12-15376" position="float">
<label>Figure 11.</label>
<caption>
<p>Application of face recovery on human body limb dataset.</p></caption>
<graphic xlink:href="sensors-12-15376f11.gif"/></fig>
<table-wrap id="t1-sensors-12-15376" position="float">
<label>Table 1.</label>
<caption>
<p>GrabCut and ST-GrabCut Segmentation results on cVSG corpus.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Approach</th>
<th align="center" valign="top">Mean overlapping</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">GrabCut</td>
<td align="center" valign="top">0.5356</td></tr>
<tr>
<td align="left" valign="top">Spatial extension</td>
<td align="center" valign="top">0.5424</td></tr>
<tr>
<td align="left" valign="top">Temporal extension</td>
<td align="center" valign="top">0.6229</td></tr>
<tr>
<td align="left" valign="top">ST-GrabCut</td>
<td align="center" valign="top">0.8747</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-sensors-12-15376" position="float">
<label>Table 2.</label>
<caption>
<p>AAM mesh fitting on original images and segmented images of the cVSG corpus.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Approach</th>
<th align="center" valign="top">Mean overlapping</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Mesh fitting without segmentation</td>
<td align="center" valign="top">0.8960</td></tr>
<tr>
<td align="left" valign="top">ST-Grabcut &amp; Temporal mesh fitting</td>
<td align="center" valign="top">0.9636</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-sensors-12-15376" position="float">
<label>Table 3.</label>
<caption>
<p>Face pose percentages on the cVSG corpus.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Face view</th>
<th align="center" valign="top">System classification</th>
<th align="center" valign="top">Real classification</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Left view</td>
<td align="center" valign="top">0.1300</td>
<td align="center" valign="top">0.1211</td></tr>
<tr>
<td align="left" valign="top"> Near Left view</td>
<td align="center" valign="top">0.1470</td>
<td align="center" valign="top">0.1347</td></tr>
<tr>
<td align="left" valign="top">Frontal view</td>
<td align="center" valign="top">0.2940</td>
<td align="center" valign="top">0.3037</td></tr>
<tr>
<td align="left" valign="top">Near Right view</td>
<td align="center" valign="top">0.1650</td>
<td align="center" valign="top">0.1813</td></tr>
<tr>
<td align="left" valign="top">Right view</td>
<td align="center" valign="top">0.2340</td>
<td align="center" valign="top">0.2590</td></tr></tbody></table></table-wrap>
<table-wrap id="t4-sensors-12-15376" position="float">
<label>Table 4.</label>
<caption>
<p>Overlapping of body limbs based on ground truth masks.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Approach</th>
<th align="center" valign="top">Mean overlapping</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Limb recovery without segmentation</td>
<td align="center" valign="top">0.7919</td></tr>
<tr>
<td align="left" valign="top">ST-Grabcut &amp; Limb recovery</td>
<td align="center" valign="top">0.8760</td></tr></tbody></table></table-wrap>
<table-wrap id="t5-sensors-12-15376" position="float">
<label>Table 5.</label>
<caption>
<p>Overlapping percentages between body parts (intersection over union), wins (comparing the highest overlapping with and without segmentation), and matching (considering only overlapping greater than 0.6).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"/>
<th align="left" valign="top"/>
<th align="center" valign="top"><bold>Trunk</bold></th>
<th align="center" valign="top"><bold>Up-arms</bold></th>
<th align="center" valign="top"><bold>Up-legs</bold></th>
<th align="center" valign="top"><bold>Low-arms</bold></th>
<th align="center" valign="top"><bold>Low-legs</bold></th>
<th align="center" valign="top"><bold>Head</bold></th>
<th align="center" valign="top"><bold>Mean</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle" rowspan="2">%</td>
<td align="left" valign="top"><bold>No segmentation</bold></td>
<td align="center" valign="top">0.58</td>
<td align="center" valign="top">0.53</td>
<td align="center" valign="top">0.59</td>
<td align="center" valign="top">0.50</td>
<td align="center" valign="top">0.48</td>
<td align="center" valign="top">0.67</td>
<td align="center" valign="top">0.56</td></tr>
<tr>
<td align="left" valign="top"><bold>STGrabCut</bold><xref ref-type="table-fn" rid="tfn1-sensors-12-15376">*</xref></td>
<td align="center" valign="top">0.58</td>
<td align="center" valign="top">0.53</td>
<td align="center" valign="top">0.58</td>
<td align="center" valign="top">0.50</td>
<td align="center" valign="top">0.56</td>
<td align="center" valign="top">0.67</td>
<td align="center" valign="top"><bold>0.57</bold></td></tr>
<tr>
<td colspan="9" valign="bottom">
<hr/></td></tr>
<tr>
<td align="center" valign="middle" rowspan="2">Wins</td>
<td align="left" valign="top"><bold>No segmentation</bold></td>
<td align="center" valign="top">106</td>
<td align="center" valign="top">104</td>
<td align="center" valign="top">108</td>
<td align="center" valign="top">109</td>
<td align="center" valign="top">68</td>
<td align="center" valign="top">120</td>
<td align="center" valign="top">102.5</td></tr>
<tr>
<td align="left" valign="top"><bold>STGrabCut</bold><xref ref-type="table-fn" rid="tfn1-sensors-12-15376">*</xref></td>
<td align="center" valign="top">121</td>
<td align="center" valign="top">123</td>
<td align="center" valign="top">119</td>
<td align="center" valign="top">118</td>
<td align="center" valign="top">159</td>
<td align="center" valign="top">107</td>
<td align="center" valign="top"><bold>124.5</bold></td></tr>
<tr>
<td colspan="9" valign="bottom">
<hr/></td></tr>
<tr>
<td align="center" valign="middle" rowspan="2">Match</td>
<td align="left" valign="top"><bold>No segmentation</bold></td>
<td align="center" valign="top">133</td>
<td align="center" valign="top">127</td>
<td align="center" valign="top">130</td>
<td align="center" valign="top">121</td>
<td align="center" valign="top">108</td>
<td align="center" valign="top">155</td>
<td align="center" valign="top">129</td></tr>
<tr>
<td align="left" valign="top"><bold>STGrabCut<xref ref-type="table-fn" rid="tfn1-sensors-12-15376">*</xref></bold></td>
<td align="center" valign="top">125</td>
<td align="center" valign="top">125</td>
<td align="center" valign="top">128</td>
<td align="center" valign="top">117</td>
<td align="center" valign="top">126</td>
<td align="center" valign="top">157</td>
<td align="center" valign="top"><bold>129.66</bold></td></tr></tbody></table>
<table-wrap-foot><fn id="tfn1-sensors-12-15376">
<label>*</label>
<p>STGrabCut was used without taking into account temporal information.</p></fn></table-wrap-foot></table-wrap></sec></back></article>
