Multiday EMG-Based Classification of Hand Motions with Deep Learning Techniques

Pattern recognition of electromyography (EMG) signals can potentially improve the performance of myoelectric control for upper limb prostheses with respect to current clinical approaches based on direct control. However, the choice of features for classification is challenging and impacts long-term performance. Here, we propose the use of EMG raw signals as direct inputs to deep networks with intrinsic feature extraction capabilities recorded over multiple days. Seven able-bodied subjects performed six active motions (plus rest), and EMG signals were recorded for 15 consecutive days with two sessions per day using the MYO armband (MYB, a wearable EMG sensor). The classification was performed by a convolutional neural network (CNN) with raw bipolar EMG samples as the inputs, and the performance was compared with linear discriminant analysis (LDA) and stacked sparse autoencoders with features (SSAE-f) and raw samples (SSAE-r) as inputs. CNN outperformed (lower classification error) both LDA and SSAE-r in the within-session, between sessions on same day, between the pair of days, and leave-out one-day evaluation (p < 0.001) analyses. However, no significant difference was found between CNN and SSAE-f. These results demonstrated that CNN significantly improved performance and increased robustness over time compared with standard LDA with associated handcrafted features. This data-driven features extraction approach may overcome the problem of the feature calibration and selection in myoelectric control.


Introduction
Myoelectric control for upper limb prostheses is based on the electrical activity (electromyogram, EMG) generated in remnant muscles, a technology that dates back to 1948 [1]. This approach is commonly used in clinical prosthetic systems [2]. Commercially available upper limb electric prostheses use conventional myoelectric control schemes, such as on/off, proportional, and direct activation [3] and are often limited to one degree of freedom (DoF) [4]. Pattern recognition (PR)-based control schemes have emerged as an alternative to conventional myoelectric control schemes for activation of multiple DoFs [5]. These systems have been widely explored to improve the multifunctionality of dexterous prosthetic hands [6][7][8][9], yet their clinical usability is still limited.
The two important steps in PR-based schemes are feature extraction and classification. The choice of optimal features for classification is a challenging task [10][11][12][13][14]. Hudgins et al. [11] proposed four time-domain features that have been subsequently extensively used and are now considered a benchmark. EMG signals are stochastic in nature, and their statistical properties may change over time, even within the same recording session. For example, changes in arm posture and fatigue influence the EMG, among several other factors [6]. These factors influence the features and decrease the performance of the systems over time [15]. Phinyomark [12,16,17] compared several alternative time and frequency-domain features individually and in combination. However, these features have so far not shown substantial advantages over the simpler time-domain features and, specifically, not in terms of the robustness of the system. Furthermore, the optimization of a single or a combination of the handcrafted EMG features [18] has proven not to be efficient in providing the adequate and robust controllability. Therefore, the data-driven automatic feature selection has been proposed to improve robustness [19][20][21].
Following feature selection, classification has been performed with several approaches, including linear discriminant analysis (LDA) [22], support vector machine (SVM) [23], artificial neural networks (ANN) [24], hidden Markov model [25], decision tree [26], Bayesian network [27], k-nearest neighbor (KNN), and random forests (RF) [28]. However, the selection of features is challenging. In the last decade, deep learning algorithms have shown promising results in several fields, including computer vision [29], natural language processing [30], speech recognition [31], and bioinformatics [32]. These algorithms are the combinations of many non-linear layers of ANN with the capability of driving data-dependent features from the raw data. Convolutional neural networks have been applied for myoelectric control with the focus on intersessions/subjects and intrasession performance, in addition to many other applications in biomedical signal processing [33][34][35]. For intersessions/subjects, some authors have used the adaptation strategy from previous sessions, while others have used separate sessions as training and testing data. These works are being performed on publicly available databases, including Ninapro [21,36], Capgmyo, and csl-hdemg [37].

Deep Learning for Myoelectric Control
In the pioneer work of CNN-based myoelectric control schemes, Park and Lee [38] developed a user-adaptive multilayer CNN algorithm to classify surface (sEMG) patterns using data from the Ninapro database and found that CNN outperformed SVM in both the non-adaptation and adaptation scheme by about 12-18 percentage points. Geng et al. [37] proposed a deep convolutional network for high-density sEMG images. They evaluated their proposed algorithm on all three databases (mentioned above) and found that deep networks outperformed the classical classifiers including KNN, SVM, LDA, and RF. However, Atzori et al. [21] showed that CNN performance was comparable to other classical classifiers, including KNN, SVM, and LDA when using a modified version of the well-known CNN architecture called LeNet [39]. Du et al. [40] presented a benchmark high-density sEMG (HDsEMG) database and developed a multilayer CNN based on a deep domain adaptation framework. They performed both the intra-and intersession/subject analysis and found that deep domain adaptation-based architecture outperformed all the other classical classifiers. Zhai et al. [15] claimed to propose a CNN-based self-recalibrating classifier that could update over time but the dataset was within the same day. They found that it significantly outperformed the SVM classifier. Du et al. [41] proposed a semi-supervised learning algorithm based on CNN for unlabeled data and used data glove to learn additional information about hand postures and temporal orders of sEMG frames. They found that classification accuracy improved significantly as compared with random forests, AtzoriNet [21], and GengNet [37]. Wei et al. [42] proposed a multi-stream CNN with decomposition and fusion stages that could learn the correlation between individual muscles. They used the divide and conquer strategy and evaluated their model on three benchmark databases. The results showed that multi-stream CNN outperformed the simple CNN and random forests classifiers. Xia et al. [20] proposed for the first time a hybrid CNN-RNN (recurrent neural network) architecture to address the variation in signals over time. The input to this network consisted of time-frequency frames from sEMG signals and evaluated it with the data of eight subjects recorded in six sessions, though on the same day. The results showed that the hybrid CNN-RNN architecture outperformed CNN and support vector regression (SVR). Allard et al. [19] performed a real-time study with transfer learning based on CNN. They collected two different datasets of 18 and 17 subjects, respectively, using an eight-channel MYO armband (a wearable EMG sensor) and controlled a six-DoF robotic arm. Their proposed CNN achieved 97.8% accuracy for seven hand movements, slightly better than the baseline CNN (96.2%). Using classification accuracy as a performance metric, these studies have shown that deep learning techniques are promising for myoelectrical control schemes. However, they were performed on datasets that were collected in single or multiple sessions (short-term) within the same day. Although the idea of a self-adaptive algorithm is noble and encouraging, robustness over time, with day-to-day variations and potential adaptation from the user, have remained unexplored.

Contribution
The use of machine learning (ML), rather than direct control, has attempted to advance the control possibilities of the users [6], but it is limited by unsatisfactory robustness to non-stationarities (e.g., changes in electrode positions and the skin-electrode interface) [43]. Robustness is the key characteristics of any clinical solution. Very advanced control systems, including recently proposed deep networks that allow a substantial functional benefit for short-term laboratory tests, cannot be translated into clinical solutions if their performance worsens over time. The long-term performance of conventional ML algorithms relies heavily on data representations called features. Thus, the problem of robustness is associated with reliable performance over time and the choice of features to describe the signal for subsequent classification. The above literature review revealed that the proposed deep learning algorithms make use of short-term recordings with prior data transformation, reducing the EMG signal into a handcrafted feature and making the problem similar to a conventional ML approach. Furthermore, in such short-term conditions, the need for deep learning with respect to more conventional ML methods is very limited, because conventional ML has been shown to be very effective (classification accuracy easily >95% for >10 classes [44]) in short laboratory recordings. Recently, it has been shown that classification accuracies vary significantly over time [45][46][47], as data recorded on one day has different characteristics from data recorded on another day due to real-world conditions. The key challenge is not the laboratory, short-term conditions but daily use. Thus, we propose a longitudinal approach to myoelectric control that makes use of a convolutional neural network architecture with raw EMG signals as inputs, in order to explore the real potential of deep learning in utilizing the intrinsic features (deep features) of the EMG signals, specifically to enhance the long-term robustness of the classification task.
Adaptive learning strategies for classifiers are promising [15,48], but this work focused on the effect of long-term bipolar EMG data recording on performance and did not explore algorithm adaptation to avoid confusion over the changes observed in performance, if there were any, were due to longitudinal data and not because of the adaptive algorithm. The EMG signals were recorded in two experimental sessions per day over 15 consecutive days. We have evaluated the performance of stacked sparse autoencoders (SSAE), an unsupervised deep learning technique, with both handcrafted features (SSAE-f) and raw EMG samples (SSAE-r) extracted from varying lengths (1-15 days) or recorded EMG. Intrasession, intersession, and inter-days analyses were performed, and the performance of both the CNN and SSAE were compared with state-of-the-art LDA.

Subjects
Seven able-bodied subjects (four males and three females, age range of 24-30 years, and mean age 27.5 years) participated in the experiments. They had no known prior history of musculoskeletal or upper extremity disorders, and their right hand was used in the experiments. The procedures were in accordance with the Declaration of Helsinki and approved by the local ethical committee of Northern Jutland (approval no: N-20160021). All the subjects participated voluntarily and provided written informed consent prior to the experimental procedures.

Wearable EMG Sensors
The data were recorded using the commercial MYO Armband (MYB). MYB are wearable EMG sensors that are developed by Thalamic Lab (Kitchener, ON, Canada, https://www.myo.com/) and have eight channels of dry electrodes with a sampling frequency of 200 Hz. It is a low-cost, consumer-grade device with a nine-axis inertial measurement unit (IMU) [26] that can communicate wirelessly with PCs via Bluetooth. It is a non-invasive, more user-friendly and time-saving device compared with conventional electrodes [49,50]. Notwithstanding the low sampling frequency, its performance has been shown to be similar to full-band EMG recordings using conventional electrodes [51,52], and the technology has been used in many studies [53][54][55][56][57][58][59][60][61][62][63][64]. Therefore, this study did not focus on the comparison with conventional EMG electrodes.

Experimental Procedure
The MYB was worn over the forearm and placed approximately three centimeters distal to the elbow crease and the olecranon process of the ulna, where it covered the surface of the extensor carpi radialis, extensor digitorum, extensor carpi ulnaris, flexor carpi radialis, palmaris longus, and flexor digitorum superficialis muscles, as shown in Figure 1.

Wearable EMG Sensors
The data were recorded using the commercial MYO Armband (MYB). MYB are wearable EMG sensors that are developed by Thalamic Lab (Kitchener, ON, Canada, https://www.myo.com/) and have eight channels of dry electrodes with a sampling frequency of 200 Hz. It is a low-cost, consumergrade device with a nine-axis inertial measurement unit (IMU) [26] that can communicate wirelessly with PCs via Bluetooth. It is a non-invasive, more user-friendly and time-saving device compared with conventional electrodes [49,50]. Notwithstanding the low sampling frequency, its performance has been shown to be similar to full-band EMG recordings using conventional electrodes [51,52], and the technology has been used in many studies [53][54][55][56][57][58][59][60][61][62][63][64]. Therefore, this study did not focus on the comparison with conventional EMG electrodes.

Experimental Procedure
The MYB was worn over the forearm and placed approximately three centimeters distal to the elbow crease and the olecranon process of the ulna, where it covered the surface of the extensor carpi radialis, extensor digitorum, extensor carpi ulnaris, flexor carpi radialis, palmaris longus, and flexor digitorum superficialis muscles, as shown in Figure 1. The protocol was designed such that each subject performed seven movements (each movement was shown to the subject using a custom-made graphical user interface (GUI) and lasted for 6 s) with 10 repetitions per movement in a single session. The data were recorded for 15 consecutive days, and two sessions were recorded per day with a break of one hour in between. Markers were placed to ensure the correct MYB placement for each session for consecutive days. The hand movements included close hand (CH), open hand (OH), wrist flexion (WF), wrist extension (WE), pronation (PRO), supination (SUP), and rest (RT), as shown in Figure 1. Each movement was repeated with a The protocol was designed such that each subject performed seven movements (each movement was shown to the subject using a custom-made graphical user interface (GUI) and lasted for 6 s) with 10 repetitions per movement in a single session. The data were recorded for 15 consecutive days, and two sessions were recorded per day with a break of one hour in between. Markers were placed to ensure the correct MYB placement for each session for consecutive days. The hand movements included close hand (CH), open hand (OH), wrist flexion (WF), wrist extension (WE), pronation (PRO), supination (SUP), and rest (RT), as shown in Figure 1. Each movement was repeated with a contraction and a relaxation period of 4 s each. The sequence of the movements was randomized for each session.

Signal Processing
The data was filtered with a third-order Butterworth high-pass filter with a cut-off frequency of 2 Hz to reduce movement artifacts. Overlapping windows of 150 ms with a step of 25 ms were extracted. For CNN-and SSAE-r-based classification, the raw EMG samples were directly used as inputs (size = 30 × 8), while for SSAE-f-and LDA-based classification, four time-domain features (TDFs) were extracted (size = 4 × 8): mean absolute value (MAV), waveform length (WL), slope sign change (SSC), and zero crossing (ZC). The choice of these TDFs and this particular machine learning algorithm (LDA) was made as most MYB-based studies [51,60] applied this combination, and recent studies [59,65] have shown that LDA achieved the highest accuracies with these TDFs as compared with ANN, SVM, KNN, RF, and naïve Bayes (NB). Furthermore, LDA is now also being used with a commercial prosthetic hand COAPT [66] (Coapt, Chicago, IL, USA, https://www.coaptengineering. com). Zero-threshold value was used for SSC and ZC features [67].
The details of SSAE and CNN are discussed in Sections 2.5 and 2.6, while LDA was used from a publicly available EMG processing library (MECLAB) [22]. In order to quantify the short-term and long-term performance of the classifiers, both within-day and between-days analyses were performed for each subject. The results are presented as means across all subjects.
The within-day analysis included within session with 10-fold validation (because there are 10 repetitions per session) and between sessions with two-fold validation (because there are two sessions per day), while between-days analysis included two-fold validation (between pairs of days and hence two-fold validation was used) and k-fold cross-validations (k = 15 days and hence 15-fold validation was used) in a leave-one-out fashion. The classification error (CE), defined as the number of samples wrongly classified divided by the total number of samples, was used as a performance metric. This is related to the classification accuracy (CA) as CA = 1 − CE. Both metrics are widely used to quantify performance in offline myoelectric control studies.

Autoencoders
Autoencoders are unsupervised deep networks [68] in which input signals are encoded to a new representation and constructed back at the output via decoders. The error between the original input and the reconstructed input is minimized using criteria such as L2 regularization (L2R), sparsity proportion (SP), and sparsity regularization (SR), and hence, it optimizes the new representation of the data (data-driven features).
In this work, two-layer SSAE were used as previously described [46,69,70] and as shown in Figure 2. The network was trained with the scale conjugate gradient descent algorithm [71] using greedy layer-wise training [72]. The parameters were optimized as previously detailed [46], and the length of layers was adjusted for both SSAE-f and SSAE-r. For SSAE-f, the sizes of the first and the second layers were 32 (k = n) and 16 units, respectively, while for SSAE-r, they were 100 (k > n) and 50 units, respectively, such that any further increment in the sizes of layers (n, m) just increased the computational cost. The final parameters' values for both layers were set as follows. L2 regularization (L2R) was set to 0.0001, sparsity regularization (SR) to 0.01, and sparsity proportion (SP) to 0.5.

Convolutional Neural Networks
The CNN is an important architecture of deep learning that is a modification of ordinary neural networks for processing multiple arrays of data, such as images, signals, and language. They work on three simple ideas, including local receptive field, shared weights, and pooling [68]. In the convolutional layer, filters are convolved with patches of input (receptive field) such that an individual filter shares the same learning weights for all patches. The dot product of filters with patches is passed through the activation unit, and the size of output is reduced via pooling.
In this work, single-layer CNN architecture (as shown in Figure 3) is implemented using the neural network toolbox in MATLAB 2017a. The input corresponds to 150 ms (30 × 8 samples) bipolar raw EMG data of eight channels. The convolutional layer includes 32 filters of size 3 × 3, a Relu layer, a max pooling layer of size 3 × 1, a fully connected layer, and a Softmax classification layer. After several trials, the weights of the network were initialized randomly [20]. The parameters were identified with manual hyperparameter tuning [21,73], using the datasets of two randomly

Convolutional Neural Networks
The CNN is an important architecture of deep learning that is a modification of ordinary neural networks for processing multiple arrays of data, such as images, signals, and language. They work on three simple ideas, including local receptive field, shared weights, and pooling [68]. In the convolutional layer, filters are convolved with patches of input (receptive field) such that an individual filter shares the same learning weights for all patches. The dot product of filters with patches is passed through the activation unit, and the size of output is reduced via pooling.
In this work, single-layer CNN architecture (as shown in Figure 3) is implemented using the neural network toolbox in MATLAB 2017a. The input corresponds to 150 ms (30 × 8 samples) bipolar raw EMG data of eight channels. The convolutional layer includes 32 filters of size 3 × 3, a Relu layer, a max pooling layer of size 3 × 1, a fully connected layer, and a Softmax classification layer.

Convolutional Neural Networks
The CNN is an important architecture of deep learning that is a modification of ordinary neural networks for processing multiple arrays of data, such as images, signals, and language. They work on three simple ideas, including local receptive field, shared weights, and pooling [68]. In the convolutional layer, filters are convolved with patches of input (receptive field) such that an individual filter shares the same learning weights for all patches. The dot product of filters with patches is passed through the activation unit, and the size of output is reduced via pooling.
In this work, single-layer CNN architecture (as shown in Figure 3) is implemented using the neural network toolbox in MATLAB 2017a. The input corresponds to 150 ms (30 × 8 samples) bipolar raw EMG data of eight channels. The convolutional layer includes 32 filters of size 3 × 3, a Relu layer, a max pooling layer of size 3 × 1, a fully connected layer, and a Softmax classification layer. After several trials, the weights of the network were initialized randomly [20]. The parameters were identified with manual hyperparameter tuning [21,73], using the datasets of two randomly After several trials, the weights of the network were initialized randomly [20]. The parameters were identified with manual hyperparameter tuning [21,73], using the datasets of two randomly selected subjects. Finally, the network was trained with stochastic gradient descent with momentum, and after several trials, the parameters were set as follows. The learning rate was set to 0.1, L2 regularization to 0.001, momentum to 0.95, batch size to 256, and max epochs of 25.

Statistical Tests
In order to compare the performance of classifiers for all analyses, statistical tests were performed with two-way analysis of variance (ANOVA) using as factors the classifiers and number of days/sessions. A post hoc multiple comparison analysis test [74] was used to compare the performance of individual classifiers. P values less than 0.05 were considered significant.

Results
The within-day analyses are presented in Sections 3.1 and 3.2, while the between-days analyses are presented in Sections 3.3 and 3.4. The results are presented as mean classification errors (CE) with standard deviation (SD) between subjects. Figure 4 shows the raw EMG data of a randomly selected session for one repetition of each movement along with a rest period, which were directly fed (without any transformation) as inputs to the CNN. selected subjects. Finally, the network was trained with stochastic gradient descent with momentum, and after several trials, the parameters were set as follows. The learning rate was set to 0.1, L2 regularization to 0.001, momentum to 0.95, batch size to 256, and max epochs of 25.

Statistical Tests
In order to compare the performance of classifiers for all analyses, statistical tests were performed with two-way analysis of variance (ANOVA) using as factors the classifiers and number of days/sessions. A post hoc multiple comparison analysis test [74] was used to compare the performance of individual classifiers. P values less than 0.05 were considered significant.

Results
The within-day analyses are presented in Sections 3.1 and 3.2, while the between-days analyses are presented in Sections 3.3 and 3.4. The results are presented as mean classification errors (CE) with standard deviation (SD) between subjects. Figure 4 shows the raw EMG data of a randomly selected session for one repetition of each movement along with a rest period, which were directly fed (without any transformation) as inputs to the CNN.

Within-Session Analysis
In this analysis, the CEs of a single session were calculated with 10-fold cross-validation, and they were averaged for 30 sessions of an individual subject. The final results are presented as the mean of all the subjects ( Figure 5).

Within-Session Analysis
In this analysis, the CEs of a single session were calculated with 10-fold cross-validation, and they were averaged for 30 sessions of an individual subject. The final results are presented as the mean of all the subjects ( Figure 5). There was no significant difference between SSAE-f and CNN (p = 0.55), while both classifiers significantly outperformed (p < 0.001) the other two classifiers (LDA and SSAE-r). The large standard deviation (SD) with the classical machine learning technique reveals that there was large difference found in the mean accuracies of the individual subjects. However, this difference was significantly reduced with the proposed techniques (SSAE-f and CNN).

Between-Sessions Analysis
In this analysis, two-fold cross-validation was used between sessions completed on the same day. Hence, for the individual subjects, the results of 15 days were averaged, and the final results are presented as the mean CE of all seven subjects, as shown in Figure 6.  There was no significant difference between SSAE-f and CNN (p = 0.55), while both classifiers significantly outperformed (p < 0.001) the other two classifiers (LDA and SSAE-r). The large standard deviation (SD) with the classical machine learning technique reveals that there was large difference found in the mean accuracies of the individual subjects. However, this difference was significantly reduced with the proposed techniques (SSAE-f and CNN).

Between-Sessions Analysis
In this analysis, two-fold cross-validation was used between sessions completed on the same day. Hence, for the individual subjects, the results of 15 days were averaged, and the final results are presented as the mean CE of all seven subjects, as shown in Figure 6. There was no significant difference between SSAE-f and CNN (p = 0.55), while both classifiers significantly outperformed (p < 0.001) the other two classifiers (LDA and SSAE-r). The large standard deviation (SD) with the classical machine learning technique reveals that there was large difference found in the mean accuracies of the individual subjects. However, this difference was significantly reduced with the proposed techniques (SSAE-f and CNN).

Between-Sessions Analysis
In this analysis, two-fold cross-validation was used between sessions completed on the same day. Hence, for the individual subjects, the results of 15 days were averaged, and the final results are presented as the mean CE of all seven subjects, as shown in Figure 6. No significant difference was found between SSAE-f and CNN (p = 0.538), while both the classifiers performed significantly better (p < 0.001) than LDA and SSAE_r (worst). No significant difference was found between SSAE-f and CNN (p = 0.538), while both the classifiers performed significantly better (p < 0.001) than LDA and SSAE_r (worst).

Analysis Between Pairs of Days
For an individual subject, the data of 15 days was organized into 105 unique pairs of days. For each pair of days, two-fold cross-validation was used, and the results of all seven subjects were averaged and are tabulated in Tables 1 and 2.  In both Tables 1 and 2, the individual cells represent the mean CE of all the subjects for the corresponding pair of days. Table 1 presents the results with data-driven features, where the upper diagonal is for CNN and the lower for SSAE-r. Table 2 presents the results with handcrafted features, where the upper diagonal is for SSAE-f and the lower for LDA. For an individual subject, the mean CE of this analysis with each classifier is also presented in Table 3.

Leave-One-Out between Days (15-Fold Cross-Validation)
In this analysis, a 15-fold cross-validation scheme was used, such that each day constitutes a separate fold. The results are presented as the mean CE of all seven subjects, as shown in Figure 7.  Although CNN achieved a comparatively lower error rate than did SSAE-f, there was no significant difference (p = 0.219) between the two, and both performed significantly better (p < 0.001) than SSAE-r and LDA. In both the between-days analyses, CNN achieved the lowest absolute CE. Table 3 summarizes the results for each individual subject as the mean CE over 15 days achieved with each classifier in all four analyses. Although CNN achieved a comparatively lower error rate than did SSAE-f, there was no significant difference (p = 0.219) between the two, and both performed significantly better (p < 0.001) than SSAE-r and LDA. In both the between-days analyses, CNN achieved the lowest absolute CE. Table 3 summarizes the results for each individual subject as the mean CE over 15 days achieved with each classifier in all four analyses.

Computational Time
The classifiers were trained on a system with NVIDIA Quadro k620 GPU (NVIDIA, Santa Clara, CA, USA), a 2.40 GHz processor and 256 GB of RAM. CNN took training and testing times of 13.10, 15.63, 32.43, and 467 s for within-session, between-sessions, pair of days, and between-days analyses, respectively, whereas SSAE-f took 23.44, 26.13, 48.42, and 607.95 s, respectively. SSAE-f achieved higher accuracy than did CNN only in the within-session analysis, while CNN achieved higher accuracies in the rest of the three analyses. Hence, CNN proved to be more robust and computationally efficient than SSAE-f.

Discussion
We evaluated deep learning techniques (both with EMG features and bipolar samples as inputs) to explore their performances in long-term EMG classification. The main finding is that bipolar EMG is suitable for deep learning when applied to CNN architecture. The performance was not good when applied with SSAE despite the size of the layers. For SSAE, handcrafted features are necessary in similar degrees as classical machine learning, such as LDA.
The results of different analyses reveal that deep learning methods (CNN and SSAE-f) not only outperformed the state-of-art LDA in the short term, but they also showed improved performance over multiple days, and hence, the results were consistent with the notion that deep network performance improves with increasing training size [75]. We hypothesize that with more days (>15), the performance of the leave-one-out analysis will converge towards the within-session performance.
This study explored both handcrafted and data-driven features-based techniques. Based on handcrafted features, classical machine learning algorithms have been widely explored for EMG-based movement classification, and the LDA has emerged as the reference classifier [16,22,76] for commercial systems [66] (https://www.coaptengineering.com). However, in this study, performance measured by CE was poorer with LDA than with autoencoders even in within-session analysis. Moreover, the LDA performance worsened over days. Based on data-driven features, CNN showed promising results as compared with the autoencoders. Although autoencoding generalized well with handcrafted features, it failed to generalize with raw data.
The results of any studies are comparable when there are similar number of classes [21] and the same hardware has been used for recording EMG. Based on the wearable MYB, several studies attempted to classify different wrist movements using either classification errors or accuracy as performance metrics. Mendez et al. [51] classified nine hand movements using the same TDF with LDA and achieved a mean accuracy of 91.67 ± 6.89. Wahid et al. [59] classified three wrist movements using the same TDF with LDA and achieved a mean accuracy of 94.45 ± 5.20 for between-subjects analysis. Masson et al. [61] classified five wrist movements with KNN and achieved an accuracy of 90%. Allard et al. [19] proposed a CNN that used spectrogram of EMG as that input that was recorded in a single session and achieved offline accuracy of 97.8% for seven wrist movements. Our proposed methods (CNN and SSAE-f) achieved mean accuracies of 97.60 ± 1.99 and 98.12 ± 1.07, respectively, in within-session analysis and also showed improved performance over days. The proposed CNN model used bipolar EMG unlike Allard et al. [19], which used a transformed version (spectrogram) of the raw EMG.
The performance of deep learning methods was dependent on the network architecture and optimal parameter selection. For SSAE, the number of units in both layers was optimized so that an increase in units did not significantly improve performance. Similarly, the performance was optimal by using non-linear and linear activation functions for both encoders and decoders respectively. Errors at layers 1 and 2 were dependent on SR and L2R, respectively. For CNN, increased size and number of filters were tested during pilot analyses with no significant improvement, most probably because of the limited sample size and the low complexity of the classification problem. Similarly, the number of epochs was also varied in the preliminary analyses. Some manually tuned parameters, including initial learning rate and momentum, played a significant role. However, varying L2R did not affect the performance. Overall, there was no significant difference between CNN and SSAE-f. However, CNN with bipolar EMG as the input achieved higher accuracies (in three out of four analyses) than did handcrafted features (LDA and SSAE-f). This study presents a preliminary step in the feasibility of reducing the between-days error with increased training size as an effective means of enhancing the long-term robustness of the classification task. However, the study comprised a limited, small number of able-bodied subjects and presented an offline analysis, which limits the possibility of generalizing the results.

Conclusions
This study demonstrated that deep learning techniques outperformed the classical machine learning algorithm using both handcrafted features and raw EMG signals as inputs. The results of intra/intersessions and between-days analyses imply that CNN has the potential to recognize EMG patterns even from raw bipolar EMG data for long-term classification notwithstanding the stochastic nature of the EMG signals. This is important in mitigating the hassle of feature selection or signal transformation prior to classification. Nevertheless, it should be noted that although CNN performed well with raw EMG, SSAE, which is another deep learning architecture, still requires feature extraction for a robust performance.