sEMG-Based Gesture Recognition with Convolution Neural Networks

The traditional classification methods for limb motion recognition based on sEMG have been deeply researched and shown promising results. However, information loss during feature extraction reduces the recognition accuracy. To obtain higher accuracy, the deep learning method was introduced. In this paper, we propose a parallel multiple-scale convolution architecture. Compared with the state-of-art methods, the proposed architecture fully considers the characteristics of the sEMG signal. Larger sizes of kernel filter than commonly used in other CNN-based hand recognition methods are adopted. Meanwhile, the characteristics of the sEMG signal, that is, muscle independence, is considered when designing the architecture. All the classification methods were evaluated on the NinaPro database. The results show that the proposed architecture has the highest recognition accuracy. Furthermore, the results indicate that parallel multiple-scale convolution architecture with larger size of kernel filter and considering muscle independence can significantly increase the classification accuracy.


Introduction
Surface electromyographic (sEMG) signals which are generated by the electrical activity of the muscle fibers can be noninvasively detected by the surface electrodes.Those signals reflect the muscle activity and provide limb movement information.Under the assumption that the patterns of sEMG signal are repeatable for the same movements and distinguishable for the different movements [1], the recognition of limb motions based on surface electromyographic signal have been widely used in many man-machine interfaces [2,3] such as upper-limb prostheses [4].However, there are some gaps between application and research [5].In practical applications, some conditions such as low power consumption [6], portable, space constraints [7] and extensive sEMG data with multiple channels and high sample rate [8] must be considered.Besides those, sEMG-based classification techniques have been extensively researched [9].
The quality of sEMG signal and the processing method are the main factors affecting the classification accuracy.The correct electrode locations, appropriate choice of channels, and the proper selection of hand gestures improve the signal quality and lead to high classification accuracy [10][11][12].When processing signals, the raw sEMG signals are rarely used directly to recognize limb motions, as it can easily be disturbed by environmental noises, electrode location shifts and loose electrode-skin Sustainability 2018, 10, 1865; doi:10.3390/su10061865www.mdpi.com/journal/sustainabilitycontacts causing inaccuracy in recognition of limb movements.To mitigate this issue and improve the accuracy, traditional methods usually consist of four phases: preprocessing, windowing, feature extraction and classification [13].Feature extraction converts the sEMG signals to a compact and informative set of features.Those features are usually hand-crafted by human experts and those extraction methods can be categorized into operating in time domain [13][14][15][16], frequency domain [17,18] and time-frequency domain (TFD) [19].For example, the features in time domain usually consist of Mean Absolute Value (MAV), The Root Mean Square (RMS), Mean Absolute Value Slope (MAVSlope), Waveform Length (WL), Slope Sign Changes (SSC), Zero Crossings (ZC) and EMG Histogram (HIST) which is an extension of the Zero Crossings.The characteristic frequency domain features include Median Frequency (MDF), the 3rd Spectral Moments (SM3) and Media Amplitude Spectrum (MDA).All these features are designed by human experts, and some have a strong correlation with muscle function.For example, the RMS is related to the constant force and non-fatiguing contraction.The ZC represents the muscle fatigue.As regard to the classification phase, the machine learning algorithms assign the extracted features to the class (gesture) they most probably belong.
In the past decade, the optimal methods of classifying EMG signal patterns have been extensively researched [20,21].Different classifiers have been introduced such as k-Nearest Neighbors (KNN) [22], neural networks [14,23], Bayesian classifier [17,24], linear discriminant analysis (LDA) [25], Support vector machine (SVM) [26,27] and Random Forests (RF) [28][29][30][31].Besides, the combination of multiple classifiers is also a desirable method to improve classification accuracy.Ahmed et al. proposed a new dynamic channel selection method which combines the multiple classifiers (LDA, SVM, quadratic discriminant analysis, Bayes classifier and extreme learning machine) in the algorithm [32].Both phases (feature extraction and classification) affect the classification accuracy, especially for the feature set.Hence, to get a higher classification accuracy, some researchers focused on the method to obtain an appropriate feature set, such as the principal component analysis (PCA) of TFD feature, nonnegative matrix factorization (NMF) algorithm and Nonlinear Multiscale Maximal Lyapunov Exponent [9,19,33].Despite the promising performance have been shown, the greatest disadvantage of those traditional methods is that some useful information may be discarded when extracting feature.
Inspired by the recent success of deep learning which has been widely used in speech recognition and computer vision [34], Atzori et al. introduced a new method based on Convolutional Neural Network (CNN) to decode the sEMG signals [35].Along the time sequence, those sEMG signals from different electrodes were regarded as the sEMG images.Being different from traditional methods, CNN can extract feature without any additional information or manually designed feature extractor.Four convolutional layers with three different sizes of kernel (3 × 3, 5 × 5 and 9 × 1) and two pooling layers are adopted.The result of [35] indicates that the classical machine learning classification methods are slightly inferior to convolution neural network with a simple architecture.The architecture of CNN has a significant influence on the classification accuracy.Geng et al. [36] and Du et al. [37] used the same ConvNet architecture which consists of four convolutional layers and two fully connected layers to recognize hand gesture by the instantaneous sEMG image.On the choice of kernel sizes, each of the first two convolutional layers consists of 64 filters of 3 × 3, while each of the last two convolutional layers consists of 64 non-overlapping filters of 1 × 1.The result shows a significant improvement in accuracy than classical classifiers.With the accuracy of 76.1% on single frame of sEMG signals and 77.8% using simple majority voting over a 200 ms windows implemented on DB2 of Ninapro [36], the architecture shows better performance than Atzori's method.Ulysse et al. [38] also adopt these small convolution kernel sizes (3 × 3 and 4 × 3) to process myoelectric information.However, they calculated the spectrograms of the raw sEMG data and delivered the spectrograms to CNN.Xiaolong et al. [39] proposed an improved method based on the spectrogram of sEMG.After the calculation of spectrogram, the principal component analysis (PCA) is performed to reduce the dimensionality.The CNN model used in [38] only contains one convolutional layer with 5 × 5 kernel sizes.After a series of the processing procedure, Xiaolong's method achieved 78.71% classification accuracy.All those previous results show that the CNN is an effective method for electromyographic signal pattern recognition.However, in the current method, the size of the kernel filter is usually the same as the size commonly used in computer vision.It might not be suitable for sEMG signals.
The choice of the size of the kernel filter should consider the characteristics of the EMG signal itself.Meanwhile, considering the non-stationary and noisy nature of myoelectric signals, the existing architecture may not be so complex that it is difficult to obtain appropriate sEMG feature set.
In this study, to better adapt to the characteristics of sEMG signals and achieve higher classification accuracy, we proposed a parallel multiple-scale convolution architecture which can extract features without any additional information or manually designed feature extractors.In the design of CNN network architecture, the characteristics of sEMG signals are considered.Unlike the kernel filter commonly used in computer vision, our architecture utilizes a larger size of kernel filter.In addition, the proposed architecture is neither the fusion of different sEMG channel information at the beginning, nor the analysis of each channel first and then fusion of each channel at the result level.Instead, the characteristics of the sEMG signal are fully considered.That is, considering the muscle independence, each sEMG channel at the front end is processed independently, eliminating the error that may be caused by premature fusion, and then the information of each channel is fused and analyzed jointly.To evaluate the proposed parallel architecture, some reference experiments such as classical method were implemented.

Database
The data used in our work are the second database (DB2) from the Ninapro project, which is a publicly accessible database and has previously been used for hand gesture recognition.In DB2 40 intact subjects (28 males, 12 females; 34 right-handed, 6 left-handed; age 29.9 ± 3.9 years) were instructed to perform 50 types of hand, wrist and functional and grasping movements, organized in three distinct sets of exercises (referred to as Exercises B, C and D in [40]).Each movement was repeated six times with a 3 s rest posture in between.
Twelve Trigno wireless electrodes were used to record the sEMG signals.Eight electrodes were located around the forearm at the height of the radiohumeral joint.Two electrodes were placed on the flexor and extensor digitorum superficialis.Two electrodes were placed on the biceps and triceps.The raw sEMG signals were sampled at a rate of 2 kHz with a baseline noise of less than 750 nV RMS.Before the raw data could be used, those signals were processed by several steps such as filtering using a Hampel filter (cleaning the signals from the 50 Hz power-line interference), synchronization and relabeling.The detail can be found in [40].
In this study, 17 hand and wrist movements of Exercise B (8 isometric and isotonic hand configurations and nine basic movements of the wrist) were considered.Approximately 2/3 of the movement repetitions (Repetitions 1, 3, 4 and 6) were used as the training set, and the other two movement repetitions (Repetitions 2 and 5) were used as the testing set.

Data Analysis and Processing
The classification procedure is similar to Englehart et al. [13] and consists of windowing, feature extraction, and classification.No preprocessing procedure such as low-pass filtering [36], fast Fourier transform (FFT) [39] and Standardization [4,41] was implemented in our algorithm.On the one hand, the preprocessing such as FFT and low-pass filtering may cause the loss of useful information.On the other hand, the preprocessing such as low-pass filtering will introduce the time latency which is not conducive to the real-time control.

Windowing
Before feeding the sEMG signals to the classification algorithm, the data should be processed to match the input dimension of the algorithm.For each channel, the sEMG signals were segmented using a sliding window with a length of L milliseconds (2L samples).The increment of the sliding windows was set to 10 ms (20 samples).Figure 1 presents the segmentation and combination of sEMG signals.The sEMG signals were converted to several 12 × 2L sEMG images for each subject, where 12 represents the number of electrodes.
The length of the window represents a compromise between time latency and classification accuracy.As described in [14], to satisfy the requirement of real-time control, the time latency is less than 300 ms.The more extended window lengths led to higher controller delays as well as increased classification accuracy [42][43][44].In previous works [13,40,45], L is greater than 200 ms to get higher classification accuracy.To test the performance of the proposed algorithm in this study, L equal to 100 ms was chosen.Ultimately, the sEMG signals from 12 electrodes were converted into the sEMG images of size 12 × 200.The length of the window represents a compromise between time latency and classification accuracy.As described in [14], to satisfy the requirement of real-time control, the time latency is less than 300 ms.The more extended window lengths led to higher controller delays as well as increased classification accuracy [42][43][44].In previous works [13,40,45], is greater than 200 ms to get higher classification accuracy.To test the performance of the proposed algorithm in this study, equal to 100 ms was chosen.Ultimately, the sEMG signals from 12 electrodes were converted into the sEMG images of size 12 200 × .

Feature Extraction and Classification
We employed the deep convolutional network to classify the hand gesture without any additional information or manually designed feature extractors.Figure 2 shows the architecture of proposed deep convolutional network named Convolution with two Parallel Block (C-B1PB2) which consists of two parts represented by the red dotted line: feature extractor and classifier.
The feature extractor is used to select the appropriate feature representation for sEMG and reduce the input dimension of the classifier.It is composed of two blocks represented by the black dotted line.
In the Block 1, five convolution layers and two maximum pooling layers are employed.The first three convolution layers contain 40 2D filters of 1 × 13 with the stride of 1 and a zero padding of 0. The last two convolution layers are similar to anterior layers except for the first dimension of kernel filter.In these two layers, the information from different electrodes is mixed to detect the relevance of each electrode.The two maximum pooling layers using the filters of 1 2 × are followed by the first and second convolution layer, respectively.The pooling layer is considered to improve the

Feature Extraction and Classification
We employed the deep convolutional network to classify the hand gesture without any additional information or manually designed feature extractors.Figure 2 shows the architecture of proposed deep convolutional network named Convolution with two Parallel Block (C-B1PB2) which consists of two parts represented by the red dotted line: feature extractor and classifier.
The feature extractor is used to select the appropriate feature representation for sEMG and reduce the input dimension of the classifier.It is composed of two blocks represented by the black dotted line.
In the Block 1, five convolution layers and two maximum pooling layers are employed.The first three convolution layers contain 40 2D filters of 1 × 13 with the stride of 1 and a zero padding of 0. The last two convolution layers are similar to anterior layers except for the first dimension of kernel filter.In these two layers, the information from different electrodes is mixed to detect the relevance of each electrode.The two maximum pooling layers using the filters of 1 × 2 are followed by the first and second convolution layer, respectively.The pooling layer is considered to improve the robustness of the algorithm.The local disturbance of sEMG signal caused by noise will not affect the classification results.
Compared with the Block 1, the Block 2 is different in first three convolution layers which adopt the bigger filter kernel size.The first three convolution layers contain 40 2D filters of 1 × 57 with the stride of 1 and a zero padding of 0. The following two convolution layers are the same as the last two convolution layers of Block 1.The pooling layers were not adopted in Block 2.
Those two blocks are parallel and do not influence the one another when extracting feature.The outputs of the two blocks are concatenated and then delivered to the classifier.
The classifier is composed of three fully connected layers and a softmax layer.The input layer consists of 520 units which are corresponding to the feature extracted by two blocks.The first and the second hidden layers consist of 260 and 130 units, respectively.The output layer has 17 units which are equal to the number of hand gestures.
In both blocks, the batch normalization is employed between each convolution layer and activation function.In classifier, after first and second fully connected layers, the dropout with a probability of 0.5 is adopted.× with the stride of 1 and a zero padding of 0. The following two convolution layers are the same as the last two convolution layers of Block 1.The pooling layers were not adopted in Block 2.
Those two blocks are parallel and do not influence the one another when extracting feature.The outputs of the two blocks are concatenated and then delivered to the classifier.
The classifier is composed of three fully connected layers and a softmax layer.The input layer consists of 520 units which are corresponding to the feature extracted by two blocks.The first and the second hidden layers consist of 260 and 130 units, respectively.The output layer has 17 units which are equal to the number of hand gestures.
In both blocks, the batch normalization is employed between each convolution layer and activation function.In classifier, after first and second fully connected layers, the dropout with a probability of 0.5 is adopted.

Experiments and Results
As described above, there are some distinguishing features of the proposed method such as parallel block and the size of the convolution kernel.Several reference experiments were conducted to evaluate the performance of C-B1PB2 with those distinctive features.

•
Classical Classification (CC): For each channel, all data were standardized to have zero mean and unit standard deviation [39].The length of sliding window was 100 ms (200 samples).The increment of sliding window was set to 10 ms (20 samples).The selected signal features include: Mean Absolute Value (MAV), Waveform Length (WL), Zero Crossings (ZC), Histogram (HIST) and marginal Discrete Wavelet Transform (mDWT) [14,43,46].The HIST needs to predefine the number of bins.The mDWT decomposes the signals in terms of a basis function (i.e., the wavelet) at different levels of resolution, resulting in a high-dimensional frequency-time representation [46].The predefined number of bins and the parameters of the wavelet are listed in Table 1.The random forests (RF) was implemented to recognize the hand gesture.

Experiments and Results
As described above, there are some distinguishing features of the proposed method such as parallel block and the size of the convolution kernel.Several reference experiments were conducted to evaluate the performance of C-B1PB2 with those distinctive features.

Classical Classification (CC):
For each channel, all data were standardized to have zero mean and unit standard deviation [39].The length of sliding window was 100 ms (200 samples).The increment of sliding window was set to 10 ms (20 samples).The selected signal features include: Mean Absolute Value (MAV), Waveform Length (WL), Zero Crossings (ZC), Histogram (HIST) and marginal Discrete Wavelet Transform (mDWT) [14,43,46].The HIST needs to predefine the number of bins.The mDWT decomposes the signals in terms of a basis function (i.e., the wavelet) at different levels of resolution, resulting in a high-dimensional frequency-time representation [46].The predefined number of bins and the parameters of the wavelet are listed in Table 1.The random forests (RF) was implemented to recognize the hand gesture.

Convolution with two parallel Block 1 (C-2B1):
As shown in Figure 3a, C-2B1 is composed of two parallel Block 1 which has been described in Figure 2. The input layer of classifier has 400 units to match the extracted feature by the two Block 1 while the remaining layers of the classifier are the same as in Figure 2.
Convolution with two parallel Block 2 (C-2B2): Figure 3b shows the architecture of C-2B2 which consists of two parallel Block 2. The input layer of the classifier is replaced by 640 units, and the rest layers remain unchanged.

Convolution with a different kernel (C-DK):
As represented in Figure 3c, the structure is the same as in Figure 2 except the first dimension of filter for each convolution layer.For the upper block in feature extractor, the first four convolution layers have 40 2D filters of 3 × 13 while the last convolution layer has 40 2D filters of 4 × 13 with the stride of 1 and a zero padding of 0. The maximum pooling layers are the same as Block 1, followed by the first and second convolution layers.For the lower block in feature extractor, the first three convolution layers have 40 2D filters of 3 × 57 while the last two convolution layers are identical to the last two convolution layers of upper block.The remaining parameters are same with C-B1PB2.

Convolution with the small kernel (C-SK):
As shown in Figure 3d, the architecture of C-SK contains two identical parallel CNN blocks.Meanwhile, the size of kernel filter is smaller than used in previous comparison experiments but is similar to the state-of-the-art methods [35][36][37][38][39].The first four convolution layers contain 40 2D filters of 3 × 3 with the stride of 1 and a zero padding of 0. The last convolution layer is similar to anterior layers except for the first dimension of kernel filter.It consists 40 2D filters of 4 × 3. The two maximum pooling layers using the filters of 1 × 2 are followed by the first and second convolution layer, respectively.As shown in Figure 3a, C-2B1 is composed of two parallel Block 1 which has been described in Figure 2. The input layer of classifier has 400 units to match the extracted feature by the two Block 1 while the remaining layers of the classifier are the same as in Figure 2.

•
Convolution with two parallel Block 2 (C-2B2): Figure 3b shows the architecture of C-2B2 which consists of two parallel Block 2. The input layer of the classifier is replaced by 640 units, and the rest layers remain unchanged.

•
Convolution with a different kernel (C-DK): As represented in Figure 3c, the structure is the same as in Figure 2 except the first dimension of filter for each convolution layer.For the upper block in feature extractor, the first four convolution layers have 40 2D filters of 3 13 × while the last convolution layer has 40 2D filters of 4 13 × with the stride of 1 and a zero padding of 0. The maximum pooling layers are the same as Block 1, followed by the first and second convolution layers.For the lower block in feature extractor, the first three convolution layers have 40 2D filters of 3 57 × while the last two convolution layers are identical to the last two convolution layers of upper block.The remaining parameters are same with C-B1PB2.

•
Convolution with the small kernel (C-SK): As shown in Figure 3d, the architecture of C-SK contains two identical parallel CNN blocks.Meanwhile, the size of kernel filter is smaller than used in previous comparison experiments but is similar to the state-of-the-art methods [35][36][37][38][39].The first four convolution layers contain 40 2D filters of 3 3 × with the stride of 1 and a zero padding of 0. The last convolution layer is similar to anterior layers except for the first dimension of kernel filter.It consists 40 2D filters of 4 3 × .The two maximum pooling layers using the filters of 1 2 × are followed by the first and second convolution layer, respectively.Convolution with the small kernel 2 (C-SK2): As represented in Figure 3e, the architecture of C-SK2 is similar to the architecture of C-SK except for the first dimension of kernel filter for each convolution layer.Compared with the C-B1PB2 method, the most difference is the second dimension of kernel filter which corresponded to the sampling points.The C-SK2 architecture adopts the smaller kernel filter.The first three convolution layers are composed by kernel filters of 1 × 3 while the last two layers use the kernel filters of 7 × 3 and 6 × 3, respectively.
By comparing the results of C-SK2, C-2B1, C-2B2, and C-B1PB2, the influence of the different size of kernel filter on classification accuracy can be obtained.The results of C-DK and C-B1PB2, or C-SK and C-SK2 can reveal the effect of considering the sEMG signal characteristics on classification accuracy.Moreover, we evaluated the C-B1PB2 on all hand gestures of NinaPro DB2 (including Exercises B, C and D) to verify the effectiveness of the proposed classification algorithm.
The Classical Classification method which consists of preprocessing, windowing, feature extraction and classification was implemented in MATLAB.The other experiments were implemented with Pytorch.
Table 2 gives the average classification accuracy results for each experiment.The first five rows show the average classification accuracy of each method on NinaPro DB2 Exercise B, while the last row shows the result on all hand movements of NinaPro DB2 (Exercises B, C, and D).In all experiments, the proposed C-B1PB2 obtains the best performance on Ninapro DB2 Exercise B, while the C-SK gets the lowest classification accuracy.Except for the C-SK which consists of small kernel filter and combines the information from different channels in every convolution layer, the other methods based on CNN get higher accuracy than classical method.The C-2B2 with a larger filter kernel size obtained a higher classification accuracy than the C-2B1 and C-SK2 methods.Meanwhile, as the size of the kernel filter increases (C-SK2, C-2B1, and C-2B2), the classification accuracy also increases.Among the CNN based methods, C-SK and C-DK, both ignoring muscle independence, achieve lower classification accuracy.
Figure 4 shows the average confusion matrix, which details the classification and misclassification of the hand gesture for proposed C-B1PB2.Movements 1, 2, 6, 12, 14 and 15, which corresponded with Thumb up, Extension of the index and middle fingers, Fingers flexed together in fist, Wrist pronation, Wrist extension and Wrist radial deviation, respectively, can be more accurately classified than remaining movements.Movements 1-8 belong to the movement of the finger while remaining movements belong to the movement of the wrist.
The C-B1PB2 method can obtain an average of 5.83% increase in accuracy compared with CC method.The result shows the effectiveness of CNN in sEMG based hand gesture recognition.In addition to the disparities caused by the framework, the most significant difference between C-B1PB2 and CC is the input data.Before windowing, the latter was filtered to remove the interference and standardized to have zero mean and unit standard deviation, while the former was not preprocessed.The preprocessing of sEMG signals will influence the performance in real-time control of upper-limb prostheses.60.27% based on the CNN method and 75.27% based on the classical classification method (Random Forests with all features), with the low-pass Butterworth filter (1 dst, 1 HZ) preprocessing and the 200 ms window.Xiaolong et al. [39] achieved a recognition accuracy of 78.71% based on the CNN method, with the preprocessing (normalization and FFT) and 200 ms window.Even without a long-length window and preprocessing, the result of our method is comparable to state-of-the-art methods on DB2.These results further confirm the effectiveness of our architecture.The selection of hand gestures affects the classification accuracy [11,12,33].The confusion matrix shows that, the more similar the hand gestures and rest posture, the lower classification rate.As for the wrist movements, the accuracy of extension and supination movements are higher than the corresponding flexion and pronation movements.It may be caused by the inherent joints.The extension and supination movements may produce a higher level of muscle activation.

Figure 1 .
Figure 1.Converting the sEMG signals to sEMG images by sliding window.P a b ( , ) represents a segment of sEMG signal from electrode b at the time a .P a represents the sEMG signals from 12 electrodes at the time a .

Figure 1 .
Figure 1.Converting the sEMG signals to sEMG images by sliding window.P (a,b) represents a segment of sEMG signal from electrode b at the time a.P a represents the sEMG signals from 12 electrodes at the time a.

Sustainability 2018 ,
10, x FOR PEER REVIEW 5 of 12 robustness of the algorithm.The local disturbance of sEMG signal caused by noise will not affect the classification results.Compared with the Block 1, the Block 2 is different in first three convolution layers which adopt the bigger filter kernel size.The first three convolution layers contain 40 2D filters of 1 57

Figure 2 .
Figure 2. Schematic of C-B1PB2 used on the sEMG signals.

Figure 2 .
Figure 2. Schematic of C-B1PB2 used on the sEMG signals.

Table 1 .
The parameters of HIST and mDWT.

Table 1 .
The parameters of HIST and mDWT.

Table 2 .
The results of classification methods.

Table 3 .
Performance comparison with state of the art methods on all hand gesture of DB2.