Dynamic Gesture Recognition System with Gesture Spotting Based on Self-Organizing Maps

: In this paper, a real-time dynamic hand gesture recognition system with gesture spotting function is proposed. In the proposed system, input video frames are converted to feature vectors, and they are used to form a posture sequence vector that represents the input gesture. Then, gesture identiﬁcation and gesture spotting are carried out in the self-organizing map (SOM)-Hebb classiﬁer. The gesture spotting function detects the end of the gesture by using the vector distance between the posture sequence vector and the winner neuron’s weight vector. The proposed gesture recognition method was tested by simulation and real-time gesture recognition experiment. Results revealed that the system could recognize nine types of gesture with an accuracy of 96.6%, and it successfully outputted the recognition result at the end of gesture using the spotting result.


Introduction
Hand gestures are one of the most important communication tools frequently used in our daily lives, and they can be used as an attractive means of human-computer interaction (HCI). Hand gestures are generally either static hand signs or dynamic hand gesture. Hand signs are static hand poses without any movements, and the hand gesture is defined as dynamic movement, which is a sequence of hand poses. Thus, a hand sign recognition system identifies the meaning of a hand pose. Meanwhile, in the dynamic gesture recognition, each gesture is defined as the trajectory of the hand movement or a sequence of hand poses.
A number of video-based hand gesture recognition algorithm and systems have been proposed [1]. This approach can use a conventional camera that most laptop PCs are equipped with. Thus, the video-based gesture recognition system can easily be implemented on widely available platforms. Another approach is based on three-dimensional hand image, which has attracted researchers in gesture recognition because the use of 3D image can improve performance [2]. However, the 3D gesture recognition requires a special device such as a Microsoft Kinect and a Leap Motion. The gesture recognition system should work in real-time for practical use. One of the important function required for the real-time dynamic gesture recognition system is gesture spotting. The gesture spotting segments a meaningful portion from a continuous data stream, and it finds the start and end of gesture. The simplest way to provide the gesture spotting is to define key posture that indicates the start and end of gesture. However, this approach disturbs the natural flow of the intended sequence of gesture. Thus, a new approach that can detect the start and end of gesture naturally in continuous sequence of hand motion, is desired.
In our previous work, a hardware hand sign recognition system was proposed, which was video based system and recognized static hand signs [3]. Its recognition algorithm

Related Work
In the gesture recognition, a hand segmentation is carried out first, which detects the hand position or hand shape. A popular segmentation method in the vision based system is skin color detection that extracts hand portion from cluttered background [5,6]. Yun et al. [7] proposed a multi-feature fusion method that improved recognition results by extracting angle count, skin color angle, and non-skin color angle in combination with Hu invariant moments features. Some gesture recognition systems simplified hand extraction from the background with the help of inexpensive color-coded gloves for hand segmentation. A glove providing color-coding with six unique colors were used in [8,9]. Wang and Popovi [10] employed an ordinary cloth glove being printed with a custom pattern that was designed to estimate the poses. Our previous work [3] also employed a two-colored glove for hand segmentation. Another option for gesture segmentation is the use of the 3D image that is taken through depth sensors, such as the Microsoft Kinect depth camera and the Leap Motion. The 3D camera views the subject in the front plane and generates a depth image of the subject, and the depth image is used for background removal, followed by the generation of the depth profile of the subject. Gesture recognition systems with the Kinect are found in [11][12][13][14][15]. Molina et al. [16] used another depth camera called Time-Of-Flight range camera that supplied real-time depth information per pixel. In terms of applicability, the vision-based gesture segmentation is desirable since it requires only a conventional camera available on most laptop PCs, and no special depth sensor is needed.
Unlike the Kinect sensor and other depth sensors, the output of the Leap Motion is the depth data which consists of palm direction, fingertips positions, palm center position, and other relevant points. Therefore, no extra computational work is needed to get these information. Due to its unique features, the Leap Motion has been applied to dynamic hand gesture recognition by by researchers. Lu et al. [17] proposed a dynamic gesture recognition system, in which the Leap Motion was used to compute feature vector of the gesture, and a hidden conditional neural field (HCNF) classifier was used to recognize dynamic hand gesture. Another example is the work done by H. Li et al. [18]. Their hand gesture recognition system was based on the Leap Motion and a spatial fuzzy matching (SFM). Hand-eye coordination means the ability to combine seeing and hand movement. Ujbanyi et al. [19,20] examined the correlations between eye motion and the motion of the mouse cursor regarding hand-eye coordination, and they used an hand-eye tracking system which was made of the Leap Motion and Eye Tribe tracker.
Challenge of real-time dynamic gesture recognition is the gesture spotting or temporal segmentation that detects when the gesture starts and ends. In the system proposed by Varshini et al. [13], each dynamic gesture was defined as a sequence of trigger-poses, and the start and end of the gesture were detected by finding the start and end triggers. Chai et al. [21] used hand positions to perform the temporal segmentation by assuming that a user put hands-up pose at the start of gesture and put hands-down pose at end of the gesture. A real-time dynamic hand gesture recognition system proposed by Chen et al. [15] used two hand configurations (open-hand, closed-hand) to achieve gesture spotting and its 3D motion trajectory of the dynamic gesture was captured by the Kinect sensor. These approaches disturb the natural flow of gesture, and thus a new approach that can detect the start and end of gesture naturally, is desirable.
A static hand gesture recognition can be achieved by applying standard pattern recognition techniques such as template matching, whereas dynamic gesture recognition requires time-series pattern recognition algorithm such as a hidden Markov model (HMM) or dynamic time warping (DTW) algorithm. The HMM is a statistical Markov model in which the system being modeled is assumed to be a Markov process. The HMM is a doubly stochastic process with an underlying stochastic process that is not observable, but can be observed through another set of stochastic processes that produce a sequence of observed symbols, and the model is known for their applications to various fields including the gesture recognition such as [22]. Problem of the gesture recognition with the HMM is that its recognition accuracy decreases if the behavior during the gesture transition has not been precisely trained. The DTW is one of the algorithms for measuring similarity between two temporal sequences which may vary in speed. Plouffe et al. [14] and Molina et al. [16] employed the DTW algorithm for their dynamic gesture recognition systems.
Another popular recognition algorithm is a neural network and its derivatives, especially deep learning methodologies [23]. Most modern deep learning models are based on convolutional neural networks (CNNs). The CNNs have been well studied and applied to fields of image recognition. The most crucial challenge in deep learning based gesture recognition is the handling of the temporal dimension. One approach uses 3D filters in the convolutional layer of the CNN. The 3D-CNN captures features of both spatial and temporal dimensions while maintaining a certain temporal structure. Another approach combines a temporal sequence modeling with a 2D (or 3D) CNN. One of the most used networks for the temporal modeling is a recurrent neural network (RNN), which can take into account the temporal data using recurrent connections in hidden layers. The drawback of this network is its short-term memory, and long short-term memory (LSTM) was proposed to solve the problem.
Molchanov et al. [24] proposed a recurrent 3D-CNN that performed simultaneous detection and classification of dynamic hand gesture from multi-modal data. Wu et al. [25] employed a novel method called deep dynamic neural networks (DDNN) for multimodal gesture recognition. The multimodal gesture recognition method based on 3D convolutional LSTM network was proposed by Zhu et al. [26]. Naguri [27] proposed a gesture recognition system based on the LSTM and a convolutional neural network (CNN) that were trained to process input sequences of 3D hand positions and velocity. Chai et al. [21] proposed a continuous gesture recognition method with a two-stream RNN (2S-RNN) for the RGB-depth image recognition. John et al. [28] proposed a vision-based gesture recognition system for automotive user interface, and they employed a long-term recurrent convolution network to classify the video sequence of the dynamic hand gesture.
A recognition system proposed by Chen et al. [15] employed a Support Vector Machine (SVM) as the recognition algorithm. Kim et al. [29] proposed a novel method to measure the video-to-video volume similarity by extending a canonical correlation analysis (CCA). Then, the proposed matching method was demonstrated for action classification by a simple nearest neighbor classifier. Jordan recurrent neural network (JRNN) is a class of recurrent neural networks, which is a three-layer network with addition of a set of context units [30]. The context units are fed from the output layer, and they have a recurrent connection to themselves. This allows the JRNN to exhibit temporal dynamic behavior and can be applied for the gesture recognition. Araga et al. [31] employed the JRNN to implement their dynamic gesture recognition system. Figure 1 outlines the flow for the gesture recognition algorithm. The proposed system consists of a feature vector generator, a sequence vector generator, and the SOM-Hebb classifier. Input to the system is video frames, and a dynamic hand gestures are assumed to be made of a sequence of F video frames. Since each frame contains different types of posture, the dynamic gesture can be classified by examining change of the posture in the F consecutive video frames.

Gesture Recognition System
Appl. Sci. 2021, 11, 1933 4 of 14 connection to themselves. This allows the JRNN to exhibit temporal dynamic behavior and can be applied for the gesture recognition. Araga et al. [31] employed the JRNN to implement their dynamic gesture recognition system. Figure 1 outlines the flow for the gesture recognition algorithm. The proposed system consists of a feature vector generator, a sequence vector generator, and the SOM-Hebb classifier. Input to the system is video frames, and a dynamic hand gestures are assumed to be made of a sequence of F video frames. Since each frame contains different types of posture, the dynamic gesture can be classified by examining change of the posture in the F consecutive video frames.

Feature Vector Generation
In the feature vector generator, the image frame that is P × Q pixels in RGB color format is converted to the feature vector G. The feature vector proposed in [3] is employed. Computation to obtain the feature vector is shown in Figure 2, which consists of a binary quantization, horizontal and vertical projection histogram calculations, and two discrete Fourier transforms (DFTs). Output is the D dimensional feature vector G.
Horizontal projection histogram

Feature Vector Generation
In the feature vector generator, the image frame that is P × Q pixels in RGB color format is converted to the feature vector → G. The feature vector proposed in [3] is employed. Computation to obtain the feature vector is shown in Figure 2, which consists of a binary quantization, horizontal and vertical projection histogram calculations, and two discrete Fourier transforms (DFTs). Output is the D dimensional feature vector → G.
Appl. Sci. 2021, 11,1933 4 of 14 connection to themselves. This allows the JRNN to exhibit temporal dynamic behavior and can be applied for the gesture recognition. Araga et al. [31] employed the JRNN to implement their dynamic gesture recognition system. Figure 1 outlines the flow for the gesture recognition algorithm. The proposed system consists of a feature vector generator, a sequence vector generator, and the SOM-Hebb classifier. Input to the system is video frames, and a dynamic hand gestures are assumed to be made of a sequence of F video frames. Since each frame contains different types of posture, the dynamic gesture can be classified by examining change of the posture in the F consecutive video frames.

Feature Vector Generation
In the feature vector generator, the image frame that is P × Q pixels in RGB color format is converted to the feature vector G. The feature vector proposed in [3] is employed. Computation to obtain the feature vector is shown in Figure 2, which consists of a binary quantization, horizontal and vertical projection histogram calculations, and two discrete Fourier transforms (DFTs). Output is the D dimensional feature vector G.  Firstly, the input color frame image is converted to a binary image I(x, y). For the system to remove the background image including the arm as well as to extract the finger segments, the user is required to wear a glove, finger portion of which is colored in red. If color of pixel is red the pixel is treated as 1, otherwise 0. Then horizontal and vertical histograms P H (y) and P V (x) of I(x, y) are calculated as follows: After the histogram calculations, DFTs are carried out on the histograms.
Here, A H (k), A V (k) and B H (k), B V (k) are real and imaginary parts of the frequency components of the histograms. Then, F H (n) and F V (n), i.e., the magnitude spectra of P H (y) and P V (x) are computed as The F H (k) and F V (k) of the same hand posture images placed in different positions are identical because they are the magnitude spectra lacking the phase information related to the hand posture position. Since most of the image's feature information is concentrated in the lower frequency components, they are used as the feature vector. The D-dimensional This feature vector → G is fed to the sequence vector generator.

Sequence Vector Generator
The SOM-1 in the sequence vector generator quantizes the input vectors, and the quantization results are sequentially stored in the shift registers. The contents of the shift registers form the sequence vector, which represents temporal change of the input posture, and is fed to the next SOM-Hebb classifier. The SOM-1 includes M 1 × M 1 neurons, and D-dimensional vector → m j that is called a weight vector is included in each neuron.
where, j is the neuron number.
Operation of the SOM is divided into learning and recall phases. The weight vectors of the neurons are trained with a set of input vectors in the learning phase. The learning phase is made of a winner search and weight update. During the recall phase, only the winner search is carried out by using the map of the trained weight vectors.
The winner neuron has the weight vector that is the nearest to the input vector. Euclidean distance V 1 (j) between the input vector and weight vector of neuron-j, is calculated for the winner search.
The winner neuron-c is then determined.
In the weight update, weight vectors of the winner and its neighborhood neurons are updated to be closer to the input vector as; where t is time index, and h(c, j, t) is a function called neighborhood function, which is defined as; where α(t) is a learning coefficient, (0 < α(t) < 1). The → r c and → r j are the coordinate vectors of the winner neuron-c, and a neuron-j, respectively. The σ(t) represents the neighborhood radius, and the weight vectors within the radius from the winner neuron are updated.
After the learning phase, all weight vectors are kept unchanged and the weight map is used in the recall phase. The winner neuron represents the cluster to which the input vector belongs, and the coordinates (w X , w Y ) of the winner neuron for the input vector are treated as the quantization result. These coordinates are stored sequentially in the shift registers, so their contents represent the sequence of the input video frames. In this paper, this vector is called the sequence vector, → X, which is a 2F-dimensional vector. Its vector element ξ m is defined as: Figure 3 shows examples explaining operation of the system. As shown in Figure 3A, a gesture is made of 10 posture images in different video frames. In the example, SOM-1 is composed of 8 × 8 neurons and Figure 3B shows the transition of the winner neuron with respect to the input video frames. Posture in the first video frame ( f = 0) makes a neuron at (w X , w Y ) = (0, 0) the winner. Then the coordinates of the winner neuron are stored in the registers in the sequence vector generator as shown in Figure 3C. Question marks in Figure 3C are the coordinates of the winner neurons of the previous gesture, which are not related to the current gesture. The winner for the second posture ( f = 1) is a neuron at (4, 2). The registers are shifted to the right and the new winner coordinates (4, 2) is stored into the registers' most left position. For the third posture at f = 2, w X = 6 and w Y = 6 are loaded into the registers. In this way, the information of the previous gesture in the registers are gradually replaced with that of the current gesture. Therefore, the sequence vector → X representing the current gesture approaches completion as the video frame progresses, and → X is completed at the 10th frame ( f = 9). The vector → X is fed to the SOM-Hebb classifier that is described in the next subsection. frame progresses, and X is completed at the 10th frame ( f = 9). The vector X is fed to the SOM-Hebb classifier that is described in the next subsection.

SOM-Hebb Classifier for Sequence Vector Classification
The SOM-Hebb classifier classifies the sequence vector X and identifies the input gesture. This classifier is the same one that was proposed in our previous work [3]. The SOM-2 in this classifier consists of M 2 ×M 2 neurons and 2F-dimensional weight vectors are included in the neurons. The SOM-2 is trained in the same way as was explained in the previous section. Note that V 2 (c) is the vector distance of the winner neuron's weight vector to the input vector that is the sequence vector X, and V 2 (c) is used to implement the gesture spotting function.
During the recall phase, the class to which the input vector belongs can be identified from the winner neuron of the SOM-2. Here, H represents the number of classes. The Hebb network generates its outputĈ h from the winner neuron. Each neuron represents a single cluster in the input vector space. Since a single gesture class may consist of combination

SOM-Hebb Classifier for Sequence Vector Classification
The SOM-Hebb classifier classifies the sequence vector → X and identifies the input gesture. This classifier is the same one that was proposed in our previous work [3]. The SOM-2 in this classifier consists of M 2 ×M 2 neurons and 2F-dimensional weight vectors are included in the neurons. The SOM-2 is trained in the same way as was explained in the previous section. Note that V 2 (c) is the vector distance of the winner neuron's weight vector to the input vector that is the sequence vector → X, and V 2 (c) is used to implement the gesture spotting function.
During the recall phase, the class to which the input vector belongs can be identified from the winner neuron of the SOM-2. Here, H represents the number of classes. The Hebb network generates its outputĈ h from the winner neuron. Each neuron represents a single cluster in the input vector space. Since a single gesture class may consist of combination of multiple clusters, multiple neurons must be associated to the single class in that case.
Selection of the neurons belonging to the same class is done by a single layer feedforward network. This network is trained by the Hebbian training algorithm, which is a supervised training. During the Hebb training, training vectors with their class data are sequentially fed to the network. Every training vector makes one of the neurons the winner. If strong correlation is found between training vectors in class h and neuron j, then the neuron j is assigned to the class h. In practice, the class of the input vectors with which the neuron j won the most, is associated to the neuron.
The SOM-2 must have appropriate number of neurons for the SOM-Hebb network to work properly. It happens that some neurons may have no connection to any gesture class. Obviously, the selection of such neuron as the winner in the recognition phase causes false recognition. To avoid this situation, neurons without connections to class ID are culled. The culling replaces the weight vectors of these neurons with huge vectors so that they never win.

Gesture Spotting
An important function required for the dynamic gesture recognition system is the gesture spotting which detects when gesture ends so that a meaningful gesture is segmented from the sequence of hand motions. The gesture spotting is implemented in the SOM-Hebb classifier by using V 2 (c) that is the vector distance of the winner neuron's weight vector in the SOM-2 to the input vector. The SOM-Hebb classifier performs the recognition for every input frame and generates its recognition resultsĈ h . However most of theĈ i are not correct because the contents of the shift registers are not complete vector sequence for the current input gesture until the last gesture frame is input. The recognition result C h is outputted only when the spotting module detects the end of gesture.
The end of gesture is detected by observing the transition of the vector distance V 2 (c). Figure 3C shows the transitions in the shift registers, which is development process of the posture sequence vector, → X. Each gesture consisted of 10 frames in this example, therefore the shift register is filled with appropriate vector's elements at 10th frame ( f = 9) and posture sequence vector → X is completed as shown in Figure 3C. The completed vector → X matches with one of the weight vectors in SOM-2, which decreases the vector distance V 2 (c) remarkably as shown in Figure 3D. After that, the distance increases because the next gesture vector elements are loaded into the register. Therefore, the end of gesture can be detected by searching a dip in the transition of the vector distance V 2 (c). However, the actual distance transition is not as smooth as that plotted in Figure 3C. The transition in the actual input fluctuates, which makes it difficult to find the dip. In order to solve the problem, a moving average of the vector distance is employed. The moving average V c ( f ) is computed as; where V c ( f ) is the V 2 (c) at video frame f, and L is the number of samples to be averaged.

Simulation and Experiment
Performance of the proposed system was examined by computer simulation and experiment.

Simulation
The system was configured as follows. Data set for the simulation was vector sequence taken from video frames of recorded gesture video. Nine types of gesture shown in Figure 4 were used for the test. We defined the gesture by using the Cambridge Hand Gesture Data set [29] for reference. As Figure 4 shows, the data set consisted of nine classes. Each class gesture was defined with 10 frames, therefore the dimension of the sequence vector → X was 20. Class labels 1 to 9 are assigned to every types of gesture. Note that the labels are used to distinguish class of the gesture types, and the number does not represent numerical character. Gesture motions are combinations of three basic poses (Flat, Spread, V-shape) and three movements (Leftward, Rightward, Contract). Thus, the gesture classes are made of three groups, i.e., 1-2-3, 4-5-6, and 7-8-9. Note that the last posture of gesture 1 is also the first posture of gesture 2, and the last posture of class 2 is the first posture of gesture class 3. The other gesture groups were designed in the same way so that the gesture classes in the same group could be performed seamlessly. The number of training vectors per class was 50, and 100 test vectors were used.
The number of gesture classes: H = 9.
Data set for the simulation was vector sequence taken from video frames of recorded gesture video. Nine types of gesture shown in Figure 4 were used for the test. We defined the gesture by using the Cambridge Hand Gesture Data set [29] for reference. As Figure 4 shows, the data set consisted of nine classes. Each class gesture was defined with 10 frames, therefore the dimension of the sequence vector X was 20. Class labels 1 to 9 are assigned to every types of gesture. Note that the labels are used to distinguish class of the gesture types, and the number does not represent numerical character. Gesture motions are combinations of three basic poses (Flat, Spread, V-shape) and three movements (Leftward, Rightward, Contract). Thus, the gesture classes are made of three groups, i.e., 1-2-3, 4-5-6, and 7-8-9. Note that the last posture of gesture 1 is also the first posture of gesture 2, and the last posture of class 2 is the first posture of gesture class 3. The other gesture groups were designed in the same way so that the gesture classes in the same group could be performed seamlessly. The number of training vectors per class was 50, and 100 test vectors were used.  For comparison purpose, recognition performance of the JRNN [30] was examined using the same feature vectors G. Table 1 summarizes the recognition accuracies of the two methods.

Real-Time Gesture Recognition
To conduct the experiment, real-time gesture recognition system was developed in software that ran on a PC. The input gesture was taken by the USB camera, and fed to the system. The recognition result was outputted only when the spotting function detected the end of the gesture. Figure 5 is a screen shot of the implemented real-time gesture recognition system. For comparison purpose, recognition performance of the JRNN [30] was examined using the same feature vectors → G. Table 1 summarizes the recognition accuracies of the two methods.

Real-Time Gesture Recognition
To conduct the experiment, real-time gesture recognition system was developed in software that ran on a PC. The input gesture was taken by the USB camera, and fed to the system. The recognition result was outputted only when the spotting function detected the end of the gesture. Figure 5 is a screen shot of the implemented real-time gesture recognition system.
The system was tested with the same gesture that was used in the simulation. The system was trained off-line by using pre-captured gesture data set. The number of training data for the off-line training was 50 samples for each gesture class. Each training sample was acquired by simply capturing 10 consecutive frames. Neither setting up the key posture that represented the gesture, nor manual selection of key frame was done in the acquisition of the training data set. The recognition system then classified the dynamic gesture presented in real-time using the weight vectors obtained from the off-line training. In the experiment, recognition test was carried out 100 times for each gesture by the same person who had provided the training data set. Most of the gesture ends were correctly detected even though the gesture groups 1-2-3, 4-5-6, and 7-8-9 were performed in succession (video of the experiment (10 fps) is available on http://www2.kansai-u.ac.jp/hikawa/ichikawa.mp4, 19 February 2021). Since the gesture in our method is defined by the fixed number of posture types, the input gesture must be made of the same number of posture types. Therefore, the speed of gesture to be recognized depends on the speed of gesture that has been captured as the training data. The recognition tests were carried out with two speeds, i.e., 5 frames per second (fps) and 10 fps. Since F = 20 (10 frames for one gesture), the gesture speeds were 2 second/gesture in case of 5 fps, and 1 second/gesture in case of 10 fps.
The system was tested with the same gesture that was used in the simulation. The system was trained off-line by using pre-captured gesture data set. The number of training data for the off-line training was 50 samples for each gesture class. Each training sample wa acquired by simply capturing 10 consecutive frames. Neither setting up the key posture tha represented the gesture, nor manual selection of key frame was done in the acquisition o the training data set. The recognition system then classified the dynamic gesture presented in real-time using the weight vectors obtained from the off-line training. In the experimen recognition test was carried out 100 times for each gesture by the same person who had provided the training data set. Most of the gesture ends were correctly detected even though the gesture groups 1-2-3, 4-5-6, and 7-8-9 were performed in succession (video o the experiment (10 fps) is available on http://www2.kansai-u.ac.jp/hikawa/ichikawa.mp4 Feb. 2021.). Since the gesture in our method is defined by the fixed number of posture types, the input gesture must be made of the same number of posture types. Therefore, th speed of gesture to be recognized depends on the speed of gesture that has been captured as the training data. The recognition tests were carried out with two speeds, i.e., 5 frame per second (fps) and 10 fps. Since F = 20 (10 frames for one gesture), the gesture speed were 2 second/gesture in case of 5 fps, and 1 second/gesture in case of 10 fps. Table 2 shows the experimental results of 5 fps frame rate. The average accuracy o the recognition was 96.6%. NS in the table is the number of cases where no spotting wa detected, and MS is the number of cases where multiple spotting occurred. Both cases wer counted as errors. Table 3 summarizes the experimental results of 10 fps frame rate. The average accuracy of the recognition rate was 97.0%, and no significant difference is found due to the speed difference.   Table 2 shows the experimental results of 5 fps frame rate. The average accuracy of the recognition was 96.6%. NS in the table is the number of cases where no spotting was detected, and MS is the number of cases where multiple spotting occurred. Both cases were counted as errors. Table 3 summarizes the experimental results of 10 fps frame rate. The average accuracy of the recognition rate was 97.0%, and no significant difference is found due to the speed difference.

Discussion
The simulation results show that the proposed method outperformed the JRNN. Difficulty of this gesture data set is that class pairs 1-2, 4-5, 7-8 are reverse gestures. Appearance of poses in the gesture is reverse order and those pairs include the same hand poses. For classes 1 and 2, the proposed system is inferior to the JRNN, but the proposed method recognized class 5 better. Another noticeable point is that the recognition accuracies of the JRNN for classes 6 and 9 are significantly worse than those of the proposed system. This is caused by the another difficulty of the data set. As shown in Figure 4, the class pairs 4-6 and 7-9 have the same poses in their beginning, which confuse the classifiers.
The experimental results shown in Tables 2 and 3 disclose that the recognition and spotting performances of the proposed system are very high. Regarding the spotting, the spotting may be easily implemented by counting the frames because the number of frames of gestures are fixed. To do so, the start of gesture must be detected correctly, and a possible method is the use of key poses to indicate the start of gestures, which was used in [15,21]. However, these approaches disturb the natural flow of gesture. Meanwhile the proposed spotting finds the end of gesture automatically when the sequence of frames matches one of the pre-trained ones, therefore the user can start gesture at any time without the key poses.
The tables also indicate that the most of the recognition errors were caused by the spotting errors. In case of NS, no spotting was detected, and the recognition result was not available. During the experiment, we observed that the proposed spotting detected the end of gesture twice in all MS cases, and two recognition results were outputted. In most of such cases, recognition results at the first spotting were incorrect and the second ones were correct. Therefore, if the detection accuracy of the spotting function is more precise, recognition results would be improved. Table 4 compares the recognition accuracies of the proposed system with the state-ofthe-art in the literature. Since experimental conditions are not the same, the accuracies in the table should not be directly compared. Six of them are real-time recognition systems, and the others were tested with various gesture data sets. Four of them are vision based systems, and the others used 3D gesture images taken from the special sensors. Vision based system is more challenging than the 3D gesture recognition since it uses limited 2D information, but it can be realized with the simple readily available cameras. Additional burdens of the real-time recognition system are high speed computation and the gesture spotting. Note that the proposed system provides the natural spotting function with no special key posture that indicates the start or end of gesture. Considering features of vision-based, real-time accurate gesture recognition and spotting function, the overall contribution of the proposed system in dynamic gesture recognition application is very high. However, even though the proposed system does not require the special sensors, it still requires users to wear the color glove, which may prevent it being used in everyday life. To solve the problem, use of the skin color detection [5,6] is one of the choice for the hand segmentation without the colored glove.

Conclusions
This paper proposed a vision-based real-time dynamic hand gesture recognition system with a gesture spotting function. In order to recognize the dynamic gesture, the SOM-SOM-Hebb classifier was newly devised. To provide the spotting function, end of gesture was detected from transitions of the vector distance between input and winner neuron's weight vectors. This gesture spotting capability made the system much more practical.
The proposed recognition algorithm was examined by simulation and real-time experiment. The results revealed that the system could recognize the nine types of gesture with an accuracy of 96.6%, which was better than that of other recognition sysems. Other advantages of our system over the compared methods are its real-time operability and the gesture spotting function.
Major drawback of the proposed system is the use of the color glove, and implementation of the hand segmentation without the glove is left for our future work. Another future research objective is to develop a hardware gesture recognition system with faster recognition speed, higher portability, and lower power consumption than those of PC implementation.