# An Interactive Image Segmentation Method in Hand Gesture Recognition

^{1}

^{2}

^{*}

Next Article in Journal

Previous Article in Journal

School of Machinery and Automation, Wuhan University of Science and Technology, Wuhan 430081, China

School of Computing, University of Portsmouth, Portsmouth PO1 3HE, UK

Author to whom correspondence should be addressed.

Received: 28 October 2016
/
Accepted: 17 January 2017
/
Published: 27 January 2017

(This article belongs to the Section Physical Sensors)

In order to improve the recognition rate of hand gestures a new interactive image segmentation method for hand gesture recognition is presented, and popular methods, e.g., Graph cut, Random walker, Interactive image segmentation using geodesic star convexity, are studied in this article. The Gaussian Mixture Model was employed for image modelling and the iteration of Expectation Maximum algorithm learns the parameters of Gaussian Mixture Model. We apply a Gibbs random field to the image segmentation and minimize the Gibbs Energy using Min-cut theorem to find the optimal segmentation. The segmentation result of our method is tested on an image dataset and compared with other methods by estimating the region accuracy and boundary accuracy. Finally five kinds of hand gestures in different backgrounds are tested on our experimental platform, and the sparse representation algorithm is used, proving that the segmentation of hand gesture images helps to improve the recognition accuracy.

Hand gesture recognition, utilized in visual input of controlling computers, is one of the most important aspects in human-computer interaction [1]. Compared with the traditional input methods, such as mice, keyboards and data gloves [2,3], the use of hand gestures to control computers will greatly reduce the user’s learning curve and further expand the application scenario. To achieve hand gesture control [4], many research achievements have been conducted by the pioneers in the field. Sophisticated data gloves can capture every single movement of finger joints by highly sensitive sensors [5,6] and store the hand gesture data. The hand gesture recognition process based on computer vision is illustrated in Figure 1. However, some essential problems have yet to be solved. Firstly, the vision-driven hand gesture recognition method is highly dependent on the sensibility of image sensors, therefore the relatively poor image quality hinders its development. Secondly, the image processing algorithms are not robust as they supposed to be, some of which cannot meet the demand to finish the segmentation correctly, while others fulfill the accuracy demands, but require too many human interactions [7], which are not efficient in real applications.

To address the above problems, with the cutting edge technologies, the image sensor industry has mushroomed recently. On the one hand, new kinds of image sensors, like the Microsoft Kinect 2.0, or Asus Xtion, have come into the commercial market [8], and the innovative infrared camera [9] makes it possible obtain depth information from image sensors. On the other hand, innovations in image processing algorithms have made them capable of segmenting accurate hand gestures, promoting in turn the accuracy of classifiers to ascribe gestures into different patterns.

The image segmentation is an important stage in the whole hand gesture recognition process, and several well-known segmentation methods have been proposed to meet different image segmentation demands. For example, in the graph cut method [10], proposed by Boykov and Jolly, the main idea was to divide one image into “object” and “background”. A gray scale histogram was established to describe the distribution of gray scale, and then a cut was drawn to divide the object and background. Max-flow/min cut algorithm was applied to minimize the energy function of one cut, and the segmentation was achieved by this minimized cut. These algorithms not only focus on the whole image, but also take every morphological detail into account. Random walker [11,12] is another supervised image segmentation method, where the image is viewed as an electric circuit. The edges are replaced by passive linear resistors, and the weight of each edge equals the electrical conductance. It proved to perform better segmentation compared with the graph cut method. Gulshan et al. [13] proposed an interactive image segmentation method, which regarded shape as a powerful cue for object recognition, making the problem well posed. The use of geodesic-star convexity made it have a much lower error rate compared with Euclidean star-convexity.

In the process of hand gesture recognition [13], the feature extraction is also very important. The image feature methods such as HOG [14], Hu invariant [15] and Haar [16] are used. In this paper, as for classifier and template matching algorithms, the sparse representation will be applied, since it requires much less sample for training. With the intention of recognising five different hand gestures, according to the dataset of hand gesture images, a dictionary will also be built. Then the K-SVD [17] algorithm is adapted for sample training, and the algorithm will be evaluate and compared with other methods.

In order to optimize the segmentation, the human visual system was carefully studied. Our eyes usually got a fuzzy picture of the whole scene at first, and then the saccadic eye movements [18] help us to obtain the details of regions of interest. With the inspiration of the human visual system, we used the Gaussian Mixture Model (GMM) [19] to get an overall view the color distributions of the image. Since the color images are mainly represented in digital formats, with tens of thousands of pixels in one image made up of red, green and blue sub-pixels, as shown in Figure 2, an M × N × 3 array was applied to store the color information in one image, where M is the horizontal resolution and N is the vertical.

The single Gaussian distribution, also known as the normal distribution [20], was proposed by the French scientist Moivre in 1733. The probability density function of a single Gaussian distribution is given by the formula:
where μ is the mathematical expectation or the mean, σ is the covariance of Gaussian distribution, and exp denotes the exponential function. For convenience, the single Gaussian distribution is usually denoted as:

$$p(x)=\frac{1}{\sqrt{2\pi}\sigma}\mathrm{exp}\left(-\frac{{(x-\mu )}^{2}}{2{\sigma}^{2}}\right)$$

$$X~N(\mu ,{\sigma}^{2})$$

The single Gaussian distribution formula is capable of dealing with gray scale pictures, because the variable x has only one dimension. One color image is an M × N × 3 array, so any element **x**_{i} in dataset $\mathit{X}=\left\{{\mathit{x}}_{1},{\mathit{x}}_{2},\dots ,{\mathit{x}}_{n}\right\}$ should be at least 3-dimensional. To address this problem, the concept of the multi-dimensional Gaussian distribution is introduced. The definition of d dimensional Gaussian distribution is:
where **μ** is a d dimensional vector, and as for the RGB model, each component of **μ** represents the average red, green and blue color density value. $\mathrm{\Sigma}$ is the covariance matrix and ${\mathrm{\Sigma}}^{-1}$ is its inverse matrix. (**x** − **μ**)^{T} is the transposed matrix of (**x** − **μ**). To simplify Equation (3) above, θ is introduced to represent the parameters **μ** and $\mathrm{\Sigma}$, then the probability density function of the d dimensional Gaussian distribution can be written as:

$$N(\mathit{x};\mathit{\mu},\mathrm{\Sigma})=\frac{1}{\sqrt{{(2\pi )}^{d}\left|\mathrm{\Sigma}\right|}}\mathrm{exp}\left[-\frac{{(\mathit{x}-\mathit{\mu})}^{T}{\mathrm{\Sigma}}^{-1}(\mathit{x}-\mathit{\mu})}{2}\right]$$

$$p(\mathit{x})=N(\mathit{x};\theta ).$$

According to the law of large numbers, every pixel is one sample of the real scene. When the resolution is high enough, the average color density could be estimated.

In reality, the color distributions of the gesture image in Figure 2 can be represented by three histograms [21], shown in Figure 3. With independent red, green and blue distributions shown in Figure 3, we can notice that the gesture image cannot be exactly described by one single Gaussian model. But there are about five peaks in each histogram, so five single Gaussian models should be applied in gesture image modelling.

GMM is introduced to approximate the continuous probability distribution by increasing the number of single Gaussian models. The probability density function of GMM with k mixed Gaussian models becomes:
where $i\in \left\{1,2,\dots ,k\right\}$ shows which single Gaussian model the component belongs to. ${\pi}_{i}$ is the mixing coefficients of k mixed component [22] or the prior probability of **x** belonging to the i-th single Gaussian model, and $\sum _{i=1}^{k}{\pi}_{i}}=1$. ${p}_{i}(\mathit{x};{\theta}_{i})$ is the probability density function of the i-th single Gaussian model, parameterized by ${\mu}_{i}$ and ${\mathrm{\Sigma}}_{i}$ in ${N}_{i}(\mathit{x};{\mathit{\mu}}_{i},{\mathrm{\Sigma}}_{i})$. $\mathrm{\Theta}$ is introduced as a parameters [23] set, {${\pi}_{1},{\pi}_{2},\dots ,{\pi}_{k},{\theta}_{1},{\theta}_{2},\dots ,{\theta}_{k}$}, to denote ${\alpha}_{i}$ and ${\theta}_{i}$.

$$p(\mathit{x})={\displaystyle \sum _{i=1}^{k}{\pi}_{i}}{p}_{i}(\mathit{x};{\theta}_{i})$$

$$p(\mathit{x})={\displaystyle \sum _{i=1}^{k}{\pi}_{i}{N}_{i}(\mathit{x};{\mu}_{i},{\mathrm{\Sigma}}_{i})},$$

As mentioned above, one RGB hand gesture image could be described in the dataset $\mathit{X}=\left\{{\mathit{x}}_{1},{\mathit{x}}_{2},\dots ,{\mathit{x}}_{n}\right\}$, and if we regard **X** as a sample, its probability density is:
where $L(X;\mathrm{\Theta})$ is called likelihood function of parameters given the sample **X**. Then we hope to find a set of parameter $\mathrm{\Theta}$ to finish modelling. According to maximum likelihood method [24], our next task is to find $\widehat{\mathrm{\Theta}}$ where:

$$p(\mathit{X};\mathrm{\Theta})={\displaystyle \prod _{j=1}^{n}p({\mathit{x}}_{j};\mathrm{\Theta})}=L(\mathrm{\Theta};\mathit{X}),{x}_{j}\in \mathit{X},$$

$$\widehat{\mathrm{\Theta}}=\underset{\mathrm{\Theta}}{\mathrm{arg}\mathrm{max}}L(\mathrm{\Theta};\mathit{X}).$$

The function $L(\mathrm{\Theta};\mathit{X})$ and $L(\mathit{X};\mathrm{\Theta})$ have the same equation form, but considering now we are going to use **X** to estimate $\mathrm{\Theta}$, the $\mathrm{\Theta}$ becomes variables and **X** are the fixed parameters, it is denoted in the second form. The value of $p(\mathit{X};\mathrm{\Theta})$ is usually too small to be calculated by computer, so we are going to replace it with the log-likelihood function [25]:

$$\mathrm{ln}(L(\mathrm{\Theta};\mathit{X}))=\mathrm{ln}\left[{\displaystyle \prod _{j=1}^{n}p({\mathit{x}}_{j};\mathrm{\Theta})}\right]$$

$$={\displaystyle \sum _{j=1}^{n}\mathrm{ln}\left[{\displaystyle \sum _{i=1}^{k}{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}\right]}.$$

After establishing the Gaussian mixture model of a RGB hand gesture image, there are still several parameters that need to be estimated. The expectation maximum (EM) algorithm [26] is introduced for the subsequent calculations. The EM algorithm is a method of acquiring the parameters set $\mathrm{\Theta}$ in the maximum likelihood method. There are two steps in this algorithm, called the E-step and M-step, respectively. To start the E-step we will introduce another probability Q_{i}(**x**_{j}). It is a posterior probability of π_{i}, in another words, the posterior probability of each **x**_{j} belonging to the i-th single Gaussian model, from the dataset **X**.

$${Q}_{i}({x}_{j})=\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{\displaystyle \sum _{t=1}^{k}{\pi}_{t}{p}_{t}({\mathit{x}}_{j};{\theta}_{t})}},$$

$$\mathrm{ln}(L(\mathrm{\Theta};\mathit{X}))={\displaystyle \sum _{j=1}^{n}\mathrm{ln}\left[{\displaystyle \sum _{i=1}^{k}{Q}_{i}({\mathit{x}}_{j})\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{Q}_{i}({\mathit{x}}_{j})}}\right]}$$

$$\ge {\displaystyle \sum _{j=1}^{n}{\displaystyle \sum _{i=1}^{k}{Q}_{i}({\mathit{x}}_{j})\mathrm{ln}\left[\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{Q}_{i}({\mathit{x}}_{j})}\right]}}.$$

From (12) to (13), the Jensen’s inequality have been applied, since $l{n}^{\u2033}(x)=-\frac{1}{{x}^{2}}\le 0$, it is concave on its domain. Then:

$$\mathrm{ln}\left[{\displaystyle \sum _{i=1}^{k}{Q}_{i}({\mathit{x}}_{j})\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{Q}_{i}({\mathit{x}}_{j})}}\right]\ge {\displaystyle \sum _{i=1}^{k}{Q}_{i}({\mathit{x}}_{j})\mathrm{ln}\left[\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{Q}_{i}({\mathit{x}}_{j})}\right]},$$

Maximizing Equation (13) guarantees that $\mathrm{ln}(L(\mathrm{\Theta};\mathit{X}))$ is maximized. The iteration of an EM algorithm estimating the new parameters in terms of the old parameters is given as follows:

- Initialization: Initialize ${\mathit{\mu}}_{i0}$ with random numbers [27], and the unit matrices are used as covariance matrices ${\mathrm{\Sigma}}_{i0}$ to start the first iteration. The mixed coefficients or prior probability is assumed as ${\pi}_{i0}=\frac{1}{k}$.
- E-step: Compute the posterior probability of ${\pi}_{i}$ using current parameters:$${Q}_{i}({\mathit{x}}_{j}):=\frac{{\pi}_{i}{p}_{i}({\mathit{x}}_{j};{\theta}_{i})}{{\displaystyle \sum _{t=1}^{k}{\pi}_{t}{p}_{t}({\mathit{x}}_{j};{\theta}_{t})}}=\frac{{\pi}_{i}N({\mathit{x}}_{j};{\mathit{\mu}}_{i},{\mathrm{\Sigma}}_{i})}{{\displaystyle \sum _{t=1}^{k}{\pi}_{t}N({\mathit{x}}_{j};{\mathit{\mu}}_{t},{\mathrm{\Sigma}}_{t})}}$$
- M-step: Renew the parameters:$${\pi}_{i}:=\frac{1}{n}{\displaystyle \sum _{j=1}^{n}{Q}_{i}({\mathit{x}}_{j})}$$$${\mu}_{i}:=\frac{{\displaystyle \sum _{j=1}^{n}{Q}_{i}({\mathit{x}}_{j}){\mathit{x}}_{t}}}{{\displaystyle \sum _{j=1}^{n}{Q}_{i}({\mathit{x}}_{j})}}$$$${\mathrm{\Sigma}}_{i}:=\frac{{\displaystyle \sum _{j=1}^{n}{Q}_{i}({\mathit{x}}_{j})({\mathit{x}}_{j}-{\mu}_{i}){({\mathit{x}}_{j}-{\mathit{\mu}}_{i})}^{T}}}{{\displaystyle \sum _{j=1}^{n}{Q}_{i}({\mathit{x}}_{j})}}$$

For most hand gesture images, the number of iterations is usually defined as a certain number. In order to improve the segmentation quality and to take account of the efficiency, the number of iterations should be 8 [28].

The modelling method discussed previously provides a universal way of dealing with hand gesture images. To segment the digital images, a mask is introduced as shown in Figure 4, which is a binary bitmap denoted as **α**. By introducing it, we changed the segmentation problem into a pixels labelling problem. As α_{j} ∈ {1,0}, the value 0 is taken for labelling background pixels and 1 for foreground pixels.

To deal with the GMM tractably, we introduce two independent k-component GMMs, one for the foreground modelling and one for the background modelling. Each pixel **x**_{j}, either from the background or the foreground model, is marked as α_{j} = 1 or 0. The parameters of each component become: θ_{i} = {π_{i}(α_{j}), μ_{i}(α_{j}), Σ_{i}(α_{j}); α_{j} = 0,1, i = 1, …, k}.

The overall color modelling completes the first step in our human visual system, to take every detail of the image into account, Gibbs random field (GRF) [29] is introduced. GRF is defined as:

$$P(\mathit{A}=\mathit{a})=\frac{1}{Z(T)}\mathrm{exp}(-\frac{1}{T}E(\mathit{\alpha})),$$

Here, $P(\mathit{A}=\mathit{a})$ gives the probability of the system **A** being in the state **a**. T is a constant parameter, whose unit is temperature in physics, and usually its value is 1. $Z(T)$ is the partition function, and:
where, $E(\mathit{\alpha})$ is interpreted as the energy function of the state **a**, to apply GRF in image segmentation, the Gibbs Energy [30] can be defined as follows:

$$Z(T)={\displaystyle \sum _{a\in A}\mathrm{exp}(-\frac{1}{T}}E(\mathit{a})),$$

$$E(\mathit{\alpha})=E(\alpha ,\mathrm{\Theta},\mathit{X})=E(\alpha ,i,\theta ,\mathit{X})=U(\alpha ,i,\theta ,\mathit{X})+V(\alpha ,\mathit{X})$$

The term $U(\alpha ,i,\theta ,\mathit{X})$, also called regional term, is defined taking account of GMM. It indicates the penalty of ${x}_{j}$ being classified in the background or foreground:
and $V(\alpha ,\mathit{X})$, which is the boundary term, which is defined to describe the smoothness between pixel ${\mathit{x}}_{u}$ and its neighbour pixels ${\mathit{x}}_{v}$ in the pixel set **N**:
where the constant γ was obtained as 50 by optimizing the efficiency over training. $[{\alpha}_{u}\ne {\alpha}_{v}]$ is an indicator function taking values 0 or 1, by judging the formula inside. β is a constant, which represents the contrast of the pixel set **N**, to adjust the exponential term. $E(x)$ in the equation below is the expectation:

$$U(\alpha ,i,\theta ,\mathit{X})={\displaystyle \sum _{j=1}^{n}-\mathrm{ln}[{p}_{i}({\mathit{x}}_{j})\times {\pi}_{i}({\alpha}_{j})]},$$

$$={\displaystyle \sum _{j=1}^{n}\left\{-\mathrm{ln}[{\pi}_{i}({\alpha}_{j})]-\mathrm{ln}[\frac{1}{2}\mathrm{ln}\left|{\mathrm{\Sigma}}_{i}({\alpha}_{j})\right|]+\frac{1}{2}{[{\mathit{x}}_{j}-{\mathit{\mu}}_{i}({\alpha}_{j})]}^{T}{\mathrm{\Sigma}}_{i}{({\alpha}_{j})}^{-1}[{\mathit{x}}_{j}-{\mathit{\mu}}_{i}({\alpha}_{j})]\right\}}.$$

$$V(\alpha ,\mathit{X})=\gamma {\displaystyle \sum _{{x}_{u},{x}_{v}\in N}[{\alpha}_{u}\ne {\alpha}_{v}]}\mathrm{exp}(-\beta {\Vert {\mathit{x}}_{u}-{\mathit{x}}_{v}\Vert}^{2}),$$

$$\beta =\frac{1}{2\underset{{x}_{u},{x}_{v}\in N}{E}[{({\mathit{x}}_{u}-{\mathit{x}}_{v})}^{T}({\mathit{x}}_{u}-{\mathit{x}}_{v})]}$$

Until now all the constants have been defined. To begin with, all the pixels in the picture are automatically marked as undefined and labeled **U** [31]. **B** is the background seed pixel set and **O** is the foreground seed set. After the training over training set **X**, the set **O** is obtained as the segmentation result and $O\subset U$. Three pixel sets are shown in Figure 5.

To achieve the segmentation automatically, we propose an initial seeds selection method in hand gesture images. Considering that the human skin color has an elliptical distribution in YCbCr color space [32], the image is transformed from RGB color space to YCbCr, using the equation below:
where, Y indicates the luminance. By setting $Y\in (0,80)$, the interference of highlights would be overcome. Then the Cb, Cr values of human skin color are located by the elliptical equations given below:
where, x and y are the intermediate variables. All the pixels satisfying the equations above will be marked as the foreground seeds, which belong to set **O**. We also define the pixels on the image edges as background seeds, which belong to set B, because the gestures are usually located far away from the edges of the images. The result of seeds selection are displayed in Figure 6 below.

$$\left[\begin{array}{c}Y\\ Cb\\ Cr\end{array}\right]=\left[\begin{array}{c}16\\ 128\\ 128\end{array}\right]+\frac{1}{256}\cdot \left[\begin{array}{ccc}65.738& 129.057& 25.06\\ -37.945& -74.494& 112.43\\ 112.439& -94.154& -18.28\end{array}\right]\cdot \left[\begin{array}{c}r\\ g\\ b\end{array}\right],$$

$$\{\begin{array}{l}\frac{{(x-1.6)}^{2}}{{26.39}^{2}}+\frac{{(y-2.41)}^{2}}{{14.03}^{2}}<1\\ \left[\begin{array}{c}x\\ y\end{array}\right]=\left[\begin{array}{cc}\mathrm{cos}(2.53)& \mathrm{sin}(2.53)\\ -\mathrm{sin}(2.53)& \mathrm{cos}(2.53)\end{array}\right]\cdot \left[\begin{array}{c}Cb-109.38\\ Cr-152.02\end{array}\right]\end{array},$$

According to the Gibbs random field, the image segmentation or pixel labelling problem equals minimizing the Gibbs energy function:

$$\underset{\{{\alpha}_{j};i\in U\}}{\mathrm{min}}[\underset{i}{\mathrm{min}}E(\alpha ,i,\theta ,\mathit{X})]$$

The min-cut/max-flow algorithm [33] is proposed to finish the segmentation more accurately. The idea of this algorithm is to regard one image as a net with nodes, and each node take the place of a corresponding pixel. Apart from that, two extra nodes, S and T, are introduced, which represent “source” and “sink”, respectively. Node S is linked to pixels belonging to O, while T linked pixels in B as shown in Figure 7.

There are three kinds of links in the neighbourhood **N**, from pixel to pixel, from pixel to S and from pixel to T, denoted as $\overline{{\mathit{x}}_{u}{\mathit{x}}_{\mathit{v}}},\text{}\overline{{\mathit{x}}_{u}\mathrm{S}},\text{}\overline{{\mathit{x}}_{u}\mathrm{T}}$. Each link is assumed with a certain weight or a cost [34] while it being cut down, which detailed in Table 1.

According to the max-flow/min-cut theorem, an optimal segmentation is defined by the minimum cut C as seen in Figure 7c. C is known as a set of $\overline{{\mathit{x}}_{u}{\mathit{x}}_{v}}$ links, so that:

$$\left|C\right|={\displaystyle \sum _{\mathit{x}\in \mathit{U}}U(C,i,\theta ,\mathit{x})}+{\displaystyle \sum _{\mathit{x}\in \mathit{N}}V(C,\mathit{x})}$$

$$=E(C,i,\theta ,\mathit{X})-\left[{\displaystyle \sum _{\mathit{x}\in \mathit{O}}U(\alpha =1,i,\theta ,\mathit{x})}+{\displaystyle \sum _{\mathit{x}\in \mathit{B}}U(\alpha =0,i,\theta ,\mathit{x})}\right]$$

Then the Gibbs energy could be minimized by using the min-cut defined above. The whole process of this segmentation is as follows: firstly, assign the GMM components i to each ${\mathit{x}}_{j}\in \mathit{U}$ according to the human select of the **U** region. Secondly, the parameters set $\mathrm{\Theta}$ is learned from the whole pixel set **X**. Thirdly, use the min-cut to minimize the Gibbs energy of the whole image. Then jump to the first step to start another round, and after eight times, the optimal segmentation will be achieved.

To evaluate interactive segmentation quantitatively, an image dataset proposed by Gulshan [13], which contains 49 images from GrabCut dataset [35], 99 images from PASCAL VOC’09 segmentation challenge [36] and 3 images from the alpha-matting dataset [37] is chosen. Those images cover all kinds of shapes, textures and backgrounds. The corresponding ground true images together with the initial seeds were also included in this dataset. The initial seed maps were made up of 4 manually generated brush-strokes all in 8 pixels wide, and one for foreground and 3 for background as shown in Figure 8.

To simulate the human interactions, after the first segmentation with initial seed map, one more seed would be generated in the largest connected segmentation error area (LEA) automatically. As shown in Figure 9a, the blue area is the segmentation result of the algorithm, while the white one is the ground true segmentation and the LEA is marked in yellow. From Figure 9b, the seed is a round dot (8 pixels in diameter), generated according to the LEA. Then we update the segmentation with all the seeds. After that, this step is repeated 20 times, and a sequence of segmentations will be obtained.

To evaluate the quality of segmentation results, we used two different methods in evaluating the region accuracy (RA) and boundary accuracy (BA). Each evaluation will be conducted to a single segmentation, and all the images in Gushan’s dataset will be tested to verify that our proposed method is suitable for interactive image segmentation.

The RA of segmentation results is evaluated by a weighted F_{β} − measure [38]. Compared with normal F_{β} − measure, the two terms Precision and Recall become:
where, TP denotes the overlap of ground true and segmented foreground pixels. FP is the wrongly segmented pixels compared with ground true images and NP represent the wrongly segmented background pixels.

$$Precisio{n}^{w}=\frac{T{P}^{w}}{T{P}^{w}+F{P}^{w}}$$

$$Recal{l}^{w}=\frac{T{P}^{w}}{T{P}^{w}+N{P}^{w}}$$

The ${F}_{\beta}^{w}-measure$ is defined as follows:
where, β signifies the effectiveness of detection with respect to a user who attaches β times as much importance to Recall^{w} as to Precision^{w}, normally β = 1. Then, we apply ${F}_{1}^{w}-measure$ to calculate the RA of different segmentation results. The higher RA is, the better the segmentation achieved is.

$$RA={F}_{\beta}^{w}=(1+{\beta}^{2})\frac{Precisio{n}^{w}\cdot Recal{l}^{w}}{{\beta}^{2}\cdot Precisio{n}^{w}+Recal{l}^{w}}$$

The BA [39] is defined according to the Hausdorff distance. The boundary pixels of ground true image and segmented image are defined as B_{GT} and B_{SEG} as shown in Figure 10.

The formula is as follows:
where, g ∈ B_{GT} and s ∈ B_{SEG}, dist(**·**) denotes the Euclidean distance, N(**·**) is the pixel number in the set. The value of BA shows the segmentation accuracy of boundaries.

$$BA=\frac{N({B}_{SEG})+N({B}_{GT})}{{\displaystyle {\sum}_{s}\underset{g}{\mathrm{min}}(dist(s,g))}+{\displaystyle {\sum}_{g}\underset{s}{\mathrm{min}}(dist(g,s))}},$$

We segmented the images from the dataset by graph cut and random walker as shown in Figure 11. The segmentation test of our method has been made on Gulshan’s dataset as well as our hand gesture images, and some of the results using our method on hand gesture image segmentation are shown here in Figure 12.

For a more rigorous test, we tested 151 images from Gulshan’s dataset and used the human interaction simulator to perform the interactions, which generated the seeds 20 times to further refine the segmentation results. The result of each simulation step has been tested on the experiment platform. The RA and BA scores are the mean values of 151 segmentations, shown in Figure 13 and Figure 14.

From the figures above, the segmentation quality shows an increase with simulated human interactions. When the seed number becomes high, a satisfactory segmentation will be achieved. Our method obtains the best segmentation quality with few human interactions. Since the seeds are generated once automatically in human hand image segmentation, our method is suitable for human image segmentation.

We defined five hand gestures: hand closed (HC), hand open (HO), wrist extension (WE), wrist flexion (WF), and fine pitch (FP), as shown in Figure 15.

One hundred images of each hand gesture were captured and segmented by the proposed method. We used the recognition framework in Figure 16. Each gesture takes 50 images for training and 50 for testing. To achieve a better classification, we extract HOG along with Hu invariant moments at the same weights. The K-SVD dictionary training method [40] is used to choose atoms representing [41] all features and reduce the computation costs.

We tested the recognition rates on both unsegmented hand images and segmented hand images. The recognition rates on unsegmented hand images are shown in Table 2, and the recognition rates on segmented hand images are shown in Table 3.

By segmenting the images before feature extraction, the recognition rates on those five hand gestures are increased compared with unsegmented images, according to the results in the tables above.

In conclusion, the interactive hand gesture image segmentation method can perfectly meet the segmentation demands of hand gesture images with no human interactions. The mechanism behind this method is carefully explored and deduced with the assistance of modern mathematical theories. Comparing the segmentation results of hand gestures with other popular image segmentation methods, our method can obtain a better segmentation accuracy and a higher quality, when there are limited seeds. Automatic seeds selection also helps to reduce human interactions. The segmentation work in turn improves the recognition rate. In future work, we could adapt this method to higher resolution pictures, which requires simplifying the calculation process. In seed selection, the automatic selection method could be improved to overcome various interferes, such as highlights, shadows and image distortion. Other future work will focus on improving the recognition rate by integrating the segmentation algorithm with more advanced recognition methods.

This work was supported by grants of National Natural Science Foundation of China (Grant No. 51575407, 51575338, 61273106, 51575412) and the EU Seventh Framework Programme (Grant No. 611391).

D.C. and G.L. conceived and designed the experiments; D.C. performed the experiments; D.C. and G.L. analyzed the data; D.C. contributed reagents/materials/analysis tools; D.C. wrote the paper; H.L., Z.J. and H.Y. edited the language.

The authors declare no conflict of interest.

- Nardi, B.A. Context and Consciousness: Activity Theory and Human-Computer Interaction; MIT Press: Cambridge, MA, USA, 1996; p. 400. [Google Scholar]
- Chen, D.C.; Li, G.F.; Jiang, G.Z.; Fang, Y.F.; Ju, Z.J.; Liu, H.H. Intelligent Computational Control of Multi-Fingered Dexterous Robotic Hand. J. Comput. Theor. Nanosci.
**2015**, 12, 6126–6132. [Google Scholar] [CrossRef][Green Version] - Ju, Z.J.; Zhu, X.Y.; Liu, H.H. Empirical Copula-Based Templates to Recognize Surface EMG Signals of Hand Motions. Int. J. Humanoid Robot.
**2011**, 8, 725–741. [Google Scholar] [CrossRef] - Miao, W.; Li, G.F.; Jiang, G.Z.; Fang, Y.; Ju, Z.J.; Liu, H.H. Optimal grasp planning of multi-fingered robotic hands: A review. Appl. Comput. Math.
**2015**, 14, 238–247. [Google Scholar] - Farina, D.; Jiang, N.; Rehbaum, H.; Holobar, A.; Graimann, B.; Dietl, H.; Aszmann, O.C. The extraction of neural information from the surface EMG for the control of upper-limb prostheses: Emerging avenues and challenges. IEEE Trans. Neural Syst. Rehabil. Eng.
**2014**, 22, 797–809. [Google Scholar] [CrossRef] [PubMed] - Ju, Z.; Liu, H. Human Hand Motion Analysis with Multisensory Information. IEEE/ASME Trans. Mechatron.
**2014**, 19, 456–466. [Google Scholar] [CrossRef][Green Version] - Panagiotakis, C.; Papadakis, H.; Grinias, E.; Komodakis, N.; Fragopoulou, P.; Tziritas, G. Interactive Image Segmentation Based on Synthetic Graph Coordinates. Pattern Recognit.
**2013**, 46, 2940–2952. [Google Scholar] [CrossRef] - Yang, D.F.; Wang, S.C.; Liu, H.P.; Liu, Z.J.; Sun, F.C. Scene modeling and autonomous navigation for robots based on kinect system. Robot
**2012**, 34, 581–589. [Google Scholar] [CrossRef] - Wang, C.; Liu, Z.; Chan, S.C. Superpixel-Based Hand Gesture Recognition with Kinect Depth Camera. Trans. Multimed.
**2015**, 17, 29–39. [Google Scholar] [CrossRef] - Sinop, A.K.; Grady, L. A Seeded Image Segmentation Framework Unifying Graph Cuts and Random Walker Which Yields a New Algorithm. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8.
- Grady, L. Multilabel random walker image segmentation using prior models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; pp. 763–770.
- Couprie, C.; Grady, L.; Najman, L.; Talbot, H. Power watersheds: A new image segmentation framework extending graph cuts, random walker and optimal spanning forest. In Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, Japan, 27 September–4 October 2009; pp. 731–738.
- Varun, G.; Carsten, R.; Antonio, C.; Andrew, B.; Andrew, Z. Geodesic star convexity for interactive image segmentation. In Proceedings of the IEEE CVPR, San Francisco, CA, USA, 13–18 June 2010; pp. 3129–3136.
- Ju, Z.; Liu, H. A Unified Fuzzy Framework for Human Hand Motion Recognition. IEEE Trans. Fuzzy Syst.
**2011**, 19, 901–913. [Google Scholar] - Xu, Y.; Yu, G.; Wang, Y.; Wu, X.; Ma, Y. A Hybrid Vehicle Detection Method Based on Viola-Jones and HOG + SVM from UAV Images. Sensors
**2016**, 16, 1325. [Google Scholar] [CrossRef] [PubMed] - Fernando, M.; Wijjayanayake, J. Novel Approach to Use Hu Moments with Image Processing Techniques for Real Time Sign Language Communication. Int. J. Image Process.
**2015**, 9, 335–345. [Google Scholar] - Chen, Q.; Georganas, N.D.; Petriu, E.M. Real-time vision-based hand gesture recognition using haar-like features. In Proceedings of the EEE Instrumentation & Measurement Technology Conference IMTC, Warsaw, Poland, 1–3 May 2007; pp. 1–6.
- Sun, R.; Wang, J.J. A Vehicle Recognition Method Based on Kernel K-SVD and Sparse Representation. Pattern Recognit. Artif. Intell.
**2014**, 27, 435–442. [Google Scholar] - Jiang, Y.V.; Won, B.-Y.; Swallow, K.M. First saccadic eye movement reveals persistent attentional guidance by implicit learning. J. Exp. Psychol. Hum. Percept. Perform.
**2014**, 40, 1161–1173. [Google Scholar] [CrossRef] [PubMed] - Ju, Z.; Liu, H.; Zhu, X.; Xiong, Y. Dynamic Grasp Recognition Using Time Clustering, Gaussian Mixture Models and Hidden Markov Models. Adv. Robot.
**2009**, 23, 1359–1371. [Google Scholar] [CrossRef] - Bian, X.; Zhang, X.; Liu, R.; Ma, L.; Fu, X. Adaptive classification of hyperspectral images using local consistency. J. Electron. Imaging
**2014**, 23, 063014. [Google Scholar] - Song, H.; Wang, Y. A spectral-spatial classification of hyperspectral images based on the algebraic multigrid method and hierarchical segmentation algorithm. Remote Sens.
**2016**, 8, 296. [Google Scholar] [CrossRef] - Hatwar, S.; Anil, W. GMM based Image Segmentation and Analysis of Image Restoration Tecniques. Int. J. Comput. Appl.
**2015**, 109, 45–50. [Google Scholar] [CrossRef] - Couprie, C.; Najman, L.; Talbot, H. Seeded segmentation methods for medical image analysis. In Medical Image Processing; Springer: New York, NY, USA, 2011; pp. 27–57. [Google Scholar]
- Bańbura, M.; Modugno, M. Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data. J. Appl. Econ.
**2014**, 29, 133–160. [Google Scholar] [CrossRef] - Simonetto, A.; Leus, G. Distributed Maximum Likelihood Sensor Network Localization. IEEE Trans. Signal Process.
**2013**, 62, 1424–1437. [Google Scholar] [CrossRef] - Ju, Z.; Liu, H. Fuzzy Gaussian Mixture Models. Pattern Recognit.
**2012**, 45, 1146–1158. [Google Scholar] [CrossRef] - Zhang, Y.; Brady, M.; Smith, S. Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging
**2001**, 20, 45–57. [Google Scholar] [CrossRef] [PubMed] - Song, W.; Cho, K.; Um, K.; Won, C.S.; Sim, S. Intuitive terrain reconstruction using height observation-based ground segmentation and 3D object boundary estimation. Sensors
**2012**, 12, 17186–17207. [Google Scholar] [CrossRef] [PubMed] - Wei, S.; Kyungeun, C.; Kyhyun, U.; Chee, S.; Sungdae, S. Complete Scene Recovery and Terrain Classification in Textured Terrain Meshes. Sensors
**2012**, 12, 11221–11237. [Google Scholar] - Liao, L.; Lin, T.; Li, B.; Zhang, W. MR brain image segmentation based on modified fuzzy C-means clustering using fuzzy GIbbs random field. J. Biomed. Eng.
**2008**, 25, 1264–1270. [Google Scholar] - Kakumanu, P.; Makrogiannis, S.; Bourbakis, N. A survey of skin-color modeling and detection methods. Pattern Recognit.
**2007**, 40, 1106–1122. [Google Scholar] [CrossRef] - Lee, G.; Lee, S.; Kim, G.; Park, J.; Park, Y. A Modified GrabCut Using a Clustering Technique to Reduce Image Noise. Symmetry
**2016**, 8, 64. [Google Scholar] [CrossRef] - Ning, J.; Zhang, L.; Zhang, D.; Wu, C. Interactive image segmentation by maximal similarity based region merging. Pattern Recognit.
**2010**, 43, 445–456. [Google Scholar] [CrossRef] - Grabcut Image Dataset. Available online: http://research.microsoft.com/enus/um/cambridge/projects/visionimagevideoediting/segmentation/grabcut.htm (accessed on 18 December 2016).
- Everingham, M.; Van, G.L.; Williams, C.K.; Winn, I.J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2009/ (accessed on 26 December 2016).
- Rhemann, C.; Rother, C.; Wang, J.; Gelautz, M.; Kohli, P.; Rott, P. A perceptually motivated online benchmark for image matting. In Proceedings of the CVPR, Miami, FL, USA, 20–25 June 2009; pp. 1826–1833.
- Margolin, R.; Zelnik-Manor, L.; Tal, A. How to Evaluate Foreground Maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255.
- Zhao, Y.; Nie, X.; Duan, Y. A benchmark for interactive image segmentation algorithms. In Proceedings of the IEEE Person-Oriented Vision, Kona, HI, USA, 7 January 2011; pp. 33–38.
- Zhou, Y.; Liu, K.; Carrillo, R.E.; Barner, K.E.; Kiamilev, F. Kernel-based sparse representation for gesture recognition. Pattern Recognit.
**2013**, 46, 3208–3222. [Google Scholar] [CrossRef] - Yu, F.; Zhou, F. Classification of machinery vibration signals based on group sparse representation. J. Vibroeng.
**2016**, 18, 1540–1545. [Google Scholar] [CrossRef]

Link Type | Weight | Precondition |
---|---|---|

$\overline{{\mathit{x}}_{u}{\mathit{x}}_{\mathit{v}}}$ | $\mathrm{exp}(-\beta \Vert {\mathit{x}}_{u}-{\mathit{x}}_{v}{\Vert}^{2})$ | ${\mathit{x}}_{u},{\mathit{x}}_{v}\in \mathit{N}$ |

$\overline{{\mathit{x}}_{u}\mathrm{S}}$ | $U(\alpha =0,i,\theta ,\mathit{X})$ | ${\mathit{x}}_{u}\in \mathit{U}$ |

K | ${\mathit{x}}_{u}\in \mathit{O}$ | |

0 | ${\mathit{x}}_{u}\in \mathit{B}$ | |

$\overline{{\mathit{x}}_{u}\mathrm{T}}$ | $U(\alpha =1,i,\theta ,\mathit{X})$ | ${\mathit{x}}_{u}\in \mathit{U}$ |

0 | ${\mathit{x}}_{u}\in \mathit{O}$ | |

K | ${\mathit{x}}_{u}\in \mathit{B}$ | |

where $K=1+\underset{{\mathit{x}}_{u}\in \mathit{X}}{\mathrm{max}}{\displaystyle \sum _{{\mathit{x}}_{u},{\mathit{x}}_{v}\in \mathit{N}}\mathrm{exp}(-\beta {\Vert {\mathit{x}}_{u}-{\mathit{x}}_{v}\Vert}^{2})}$ |

Gestures | Recognition Rates |
---|---|

Hand close | 86.7% |

Hand open | 73.3% |

Wrist extension | 100% |

Wrist flexion | 100% |

Fine pitch | 66.7% |

Over all rate | 85.3% |

Gestures | Recognition Rates |
---|---|

Hand close | 93.3% |

Hand open | 100% |

Wrist extension | 100% |

Wrist flexion | 100% |

Fine pitch | 100% |

Over all rate | 98.7% |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).