1. Introduction
Hand gesture recognition, utilized in visual input of controlling computers, is one of the most important aspects in human-computer interaction [
1]. Compared with the traditional input methods, such as mice, keyboards and data gloves [
2,
3], the use of hand gestures to control computers will greatly reduce the user’s learning curve and further expand the application scenario. To achieve hand gesture control [
4], many research achievements have been conducted by the pioneers in the field. Sophisticated data gloves can capture every single movement of finger joints by highly sensitive sensors [
5,
6] and store the hand gesture data. The hand gesture recognition process based on computer vision is illustrated in
Figure 1. However, some essential problems have yet to be solved. Firstly, the vision-driven hand gesture recognition method is highly dependent on the sensibility of image sensors, therefore the relatively poor image quality hinders its development. Secondly, the image processing algorithms are not robust as they supposed to be, some of which cannot meet the demand to finish the segmentation correctly, while others fulfill the accuracy demands, but require too many human interactions [
7], which are not efficient in real applications.
To address the above problems, with the cutting edge technologies, the image sensor industry has mushroomed recently. On the one hand, new kinds of image sensors, like the Microsoft Kinect 2.0, or Asus Xtion, have come into the commercial market [
8], and the innovative infrared camera [
9] makes it possible obtain depth information from image sensors. On the other hand, innovations in image processing algorithms have made them capable of segmenting accurate hand gestures, promoting in turn the accuracy of classifiers to ascribe gestures into different patterns.
The image segmentation is an important stage in the whole hand gesture recognition process, and several well-known segmentation methods have been proposed to meet different image segmentation demands. For example, in the graph cut method [
10], proposed by Boykov and Jolly, the main idea was to divide one image into “object” and “background”. A gray scale histogram was established to describe the distribution of gray scale, and then a cut was drawn to divide the object and background. Max-flow/min cut algorithm was applied to minimize the energy function of one cut, and the segmentation was achieved by this minimized cut. These algorithms not only focus on the whole image, but also take every morphological detail into account. Random walker [
11,
12] is another supervised image segmentation method, where the image is viewed as an electric circuit. The edges are replaced by passive linear resistors, and the weight of each edge equals the electrical conductance. It proved to perform better segmentation compared with the graph cut method. Gulshan et al. [
13] proposed an interactive image segmentation method, which regarded shape as a powerful cue for object recognition, making the problem well posed. The use of geodesic-star convexity made it have a much lower error rate compared with Euclidean star-convexity.
In the process of hand gesture recognition [
13], the feature extraction is also very important. The image feature methods such as HOG [
14], Hu invariant [
15] and Haar [
16] are used. In this paper, as for classifier and template matching algorithms, the sparse representation will be applied, since it requires much less sample for training. With the intention of recognising five different hand gestures, according to the dataset of hand gesture images, a dictionary will also be built. Then the K-SVD [
17] algorithm is adapted for sample training, and the algorithm will be evaluate and compared with other methods.
2. Modelling of Hand Gesture Images
In order to optimize the segmentation, the human visual system was carefully studied. Our eyes usually got a fuzzy picture of the whole scene at first, and then the saccadic eye movements [
18] help us to obtain the details of regions of interest. With the inspiration of the human visual system, we used the Gaussian Mixture Model (GMM) [
19] to get an overall view the color distributions of the image. Since the color images are mainly represented in digital formats, with tens of thousands of pixels in one image made up of red, green and blue sub-pixels, as shown in
Figure 2, an M × N × 3 array was applied to store the color information in one image, where M is the horizontal resolution and N is the vertical.
2.1. Single Gaussian Model
The single Gaussian distribution, also known as the normal distribution [
20], was proposed by the French scientist Moivre in 1733. The probability density function of a single Gaussian distribution is given by the formula:
where
μ is the mathematical expectation or the mean,
σ is the covariance of Gaussian distribution, and exp denotes the exponential function. For convenience, the single Gaussian distribution is usually denoted as:
The single Gaussian distribution formula is capable of dealing with gray scale pictures, because the variable
x has only one dimension. One color image is an M × N × 3 array, so any element
xi in dataset
should be at least 3-dimensional. To address this problem, the concept of the multi-dimensional Gaussian distribution is introduced. The definition of
d dimensional Gaussian distribution is:
where
μ is a
d dimensional vector, and as for the RGB model, each component of
μ represents the average red, green and blue color density value.
is the covariance matrix and
is its inverse matrix. (
x −
μ)
T is the transposed matrix of (
x −
μ). To simplify Equation (3) above,
θ is introduced to represent the parameters
μ and
, then the probability density function of the
d dimensional Gaussian distribution can be written as:
According to the law of large numbers, every pixel is one sample of the real scene. When the resolution is high enough, the average color density could be estimated.
2.2. Gaussian Mixture Model of RGB Image
In reality, the color distributions of the gesture image in
Figure 2 can be represented by three histograms [
21], shown in
Figure 3. With independent red, green and blue distributions shown in
Figure 3, we can notice that the gesture image cannot be exactly described by one single Gaussian model. But there are about five peaks in each histogram, so five single Gaussian models should be applied in gesture image modelling.
GMM is introduced to approximate the continuous probability distribution by increasing the number of single Gaussian models. The probability density function of GMM with k mixed Gaussian models becomes:
where
shows which single Gaussian model the component belongs to.
is the mixing coefficients of
k mixed component [
22] or the prior probability of
x belonging to the
i-th single Gaussian model, and
.
is the probability density function of the
i-th single Gaussian model, parameterized by
and
in
.
is introduced as a parameters [
23] set, {
}, to denote
and
.
As mentioned above, one RGB hand gesture image could be described in the dataset
, and if we regard
X as a sample, its probability density is:
where
is called likelihood function of parameters given the sample
X. Then we hope to find a set of parameter
to finish modelling. According to maximum likelihood method [
24], our next task is to find
where:
The function
and
have the same equation form, but considering now we are going to use
X to estimate
, the
becomes variables and
X are the fixed parameters, it is denoted in the second form. The value of
is usually too small to be calculated by computer, so we are going to replace it with the log-likelihood function [
25]:
2.3. Expectation Maximum Algorithm
After establishing the Gaussian mixture model of a RGB hand gesture image, there are still several parameters that need to be estimated. The expectation maximum (EM) algorithm [
26] is introduced for the subsequent calculations. The EM algorithm is a method of acquiring the parameters set
in the maximum likelihood method. There are two steps in this algorithm, called the E-step and M-step, respectively. To start the E-step we will introduce another probability
Qi(
xj). It is a posterior probability of
πi, in another words, the posterior probability of each
xj belonging to the
i-th single Gaussian model, from the dataset
X.
where the definition of
is given according to Bayes’ theorem, and
. Then we use Equation (11) to modify the log-likelihood function in (10):
From (12) to (13), the Jensen’s inequality have been applied, since
, it is concave on its domain. Then:
Maximizing Equation (13) guarantees that
is maximized. The iteration of an EM algorithm estimating the new parameters in terms of the old parameters is given as follows:
Initialization: Initialize
with random numbers [
27], and the unit matrices are used as covariance matrices
to start the first iteration. The mixed coefficients or prior probability is assumed as
.
E-step: Compute the posterior probability of
using current parameters:
M-step: Renew the parameters:
For most hand gesture images, the number of iterations is usually defined as a certain number. In order to improve the segmentation quality and to take account of the efficiency, the number of iterations should be 8 [
28].
3. Interactive Image Segmentation
The modelling method discussed previously provides a universal way of dealing with hand gesture images. To segment the digital images, a mask is introduced as shown in
Figure 4, which is a binary bitmap denoted as
α. By introducing it, we changed the segmentation problem into a pixels labelling problem. As
αj ∈ {1,0}, the value 0 is taken for labelling background pixels and 1 for foreground pixels.
To deal with the GMM tractably, we introduce two independent k-component GMMs, one for the foreground modelling and one for the background modelling. Each pixel xj, either from the background or the foreground model, is marked as αj = 1 or 0. The parameters of each component become: θi = {πi(αj), μi(αj), Σi(αj); αj = 0,1, i = 1, …, k}.
3.1. Gibbs Random Field
The overall color modelling completes the first step in our human visual system, to take every detail of the image into account, Gibbs random field (GRF) [
29] is introduced. GRF is defined as:
Here,
gives the probability of the system
A being in the state
a.
T is a constant parameter, whose unit is temperature in physics, and usually its value is 1.
is the partition function, and:
where,
is interpreted as the energy function of the state
a, to apply GRF in image segmentation, the Gibbs Energy [
30] can be defined as follows:
The term
, also called regional term, is defined taking account of GMM. It indicates the penalty of
being classified in the background or foreground:
and
, which is the boundary term, which is defined to describe the smoothness between pixel
and its neighbour pixels
in the pixel set
N:
where the constant γ was obtained as 50 by optimizing the efficiency over training.
is an indicator function taking values 0 or 1, by judging the formula inside.
β is a constant, which represents the contrast of the pixel set
N, to adjust the exponential term.
in the equation below is the expectation:
3.2. Automatical Seed Selection
Until now all the constants have been defined. To begin with, all the pixels in the picture are automatically marked as undefined and labeled
U [
31].
B is the background seed pixel set and
O is the foreground seed set. After the training over training set
X, the set
O is obtained as the segmentation result and
. Three pixel sets are shown in
Figure 5.
To achieve the segmentation automatically, we propose an initial seeds selection method in hand gesture images. Considering that the human skin color has an elliptical distribution in
YCbCr color space [
32], the image is transformed from RGB color space to
YCbCr, using the equation below:
where,
Y indicates the luminance. By setting
, the interference of highlights would be overcome. Then the
Cb,
Cr values of human skin color are located by the elliptical equations given below:
where,
x and
y are the intermediate variables. All the pixels satisfying the equations above will be marked as the foreground seeds, which belong to set
O. We also define the pixels on the image edges as background seeds, which belong to set
B, because the gestures are usually located far away from the edges of the images. The result of seeds selection are displayed in
Figure 6 below.
3.3. Min-Cut/Max-Flow Algorithm
According to the Gibbs random field, the image segmentation or pixel labelling problem equals minimizing the Gibbs energy function:
The min-cut/max-flow algorithm [
33] is proposed to finish the segmentation more accurately. The idea of this algorithm is to regard one image as a net with nodes, and each node take the place of a corresponding pixel. Apart from that, two extra nodes,
S and
T, are introduced, which represent “source” and “sink”, respectively. Node
S is linked to pixels belonging to
O, while
T linked pixels in
B as shown in
Figure 7.
There are three kinds of links in the neighbourhood
N, from pixel to pixel, from pixel to
S and from pixel to
T, denoted as
. Each link is assumed with a certain weight or a cost [
34] while it being cut down, which detailed in
Table 1.
According to the max-flow/min-cut theorem, an optimal segmentation is defined by the minimum cut
C as seen in
Figure 7c.
C is known as a set of
links, so that:
Then the Gibbs energy could be minimized by using the min-cut defined above. The whole process of this segmentation is as follows: firstly, assign the GMM components i to each according to the human select of the U region. Secondly, the parameters set is learned from the whole pixel set X. Thirdly, use the min-cut to minimize the Gibbs energy of the whole image. Then jump to the first step to start another round, and after eight times, the optimal segmentation will be achieved.
4. Experimental Comparison
To evaluate interactive segmentation quantitatively, an image dataset proposed by Gulshan [
13], which contains 49 images from GrabCut dataset [
35], 99 images from PASCAL VOC’09 segmentation challenge [
36] and 3 images from the alpha-matting dataset [
37] is chosen. Those images cover all kinds of shapes, textures and backgrounds. The corresponding ground true images together with the initial seeds were also included in this dataset. The initial seed maps were made up of 4 manually generated brush-strokes all in 8 pixels wide, and one for foreground and 3 for background as shown in
Figure 8.
To simulate the human interactions, after the first segmentation with initial seed map, one more seed would be generated in the largest connected segmentation error area (LEA) automatically. As shown in
Figure 9a, the blue area is the segmentation result of the algorithm, while the white one is the ground true segmentation and the LEA is marked in yellow. From
Figure 9b, the seed is a round dot (8 pixels in diameter), generated according to the LEA. Then we update the segmentation with all the seeds. After that, this step is repeated 20 times, and a sequence of segmentations will be obtained.
To evaluate the quality of segmentation results, we used two different methods in evaluating the region accuracy (RA) and boundary accuracy (BA). Each evaluation will be conducted to a single segmentation, and all the images in Gushan’s dataset will be tested to verify that our proposed method is suitable for interactive image segmentation.
4.1. Region Accuracy
The
RA of segmentation results is evaluated by a weighted
Fβ −
measure [
38]. Compared with normal
Fβ −
measure, the two terms
Precision and
Recall become:
where,
TP denotes the overlap of ground true and segmented foreground pixels.
FP is the wrongly segmented pixels compared with ground true images and
NP represent the wrongly segmented background pixels.
The
is defined as follows:
where,
β signifies the effectiveness of detection with respect to a user who attaches
β times as much importance to
Recallw as to
Precisionw, normally
β = 1. Then, we apply
to calculate the
RA of different segmentation results. The higher
RA is, the better the segmentation achieved is.
4.2. Boundary Accuracy
The
BA [
39] is defined according to the Hausdorff distance. The boundary pixels of ground true image and segmented image are defined as
BGT and
BSEG as shown in
Figure 10.
The formula is as follows:
where,
g ∈
BGT and
s ∈
BSEG,
dist(
·) denotes the Euclidean distance,
N(
·) is the pixel number in the set. The value of
BA shows the segmentation accuracy of boundaries.
4.3. Results Analysis
We segmented the images from the dataset by graph cut and random walker as shown in
Figure 11. The segmentation test of our method has been made on Gulshan’s dataset as well as our hand gesture images, and some of the results using our method on hand gesture image segmentation are shown here in
Figure 12.
For a more rigorous test, we tested 151 images from Gulshan’s dataset and used the human interaction simulator to perform the interactions, which generated the seeds 20 times to further refine the segmentation results. The result of each simulation step has been tested on the experiment platform. The RA and BA scores are the mean values of 151 segmentations, shown in
Figure 13 and
Figure 14.
From the figures above, the segmentation quality shows an increase with simulated human interactions. When the seed number becomes high, a satisfactory segmentation will be achieved. Our method obtains the best segmentation quality with few human interactions. Since the seeds are generated once automatically in human hand image segmentation, our method is suitable for human image segmentation.
5. Hand Gesture Recognition
We defined five hand gestures: hand closed (HC), hand open (HO), wrist extension (WE), wrist flexion (WF), and fine pitch (FP), as shown in
Figure 15.
One hundred images of each hand gesture were captured and segmented by the proposed method. We used the recognition framework in
Figure 16. Each gesture takes 50 images for training and 50 for testing. To achieve a better classification, we extract HOG along with Hu invariant moments at the same weights. The K-SVD dictionary training method [
40] is used to choose atoms representing [
41] all features and reduce the computation costs.
We tested the recognition rates on both unsegmented hand images and segmented hand images. The recognition rates on unsegmented hand images are shown in
Table 2, and the recognition rates on segmented hand images are shown in
Table 3.
By segmenting the images before feature extraction, the recognition rates on those five hand gestures are increased compared with unsegmented images, according to the results in the tables above.
6. Conclusions and Future Work
In conclusion, the interactive hand gesture image segmentation method can perfectly meet the segmentation demands of hand gesture images with no human interactions. The mechanism behind this method is carefully explored and deduced with the assistance of modern mathematical theories. Comparing the segmentation results of hand gestures with other popular image segmentation methods, our method can obtain a better segmentation accuracy and a higher quality, when there are limited seeds. Automatic seeds selection also helps to reduce human interactions. The segmentation work in turn improves the recognition rate. In future work, we could adapt this method to higher resolution pictures, which requires simplifying the calculation process. In seed selection, the automatic selection method could be improved to overcome various interferes, such as highlights, shadows and image distortion. Other future work will focus on improving the recognition rate by integrating the segmentation algorithm with more advanced recognition methods.