Open Database for Accurate Upper-Limb Intent Detection Using Electromyography and Reliable Extreme Learning Machines

Surface Electromyography (sEMG) signal processing has a disruptive technology potential to enable a natural human interface with artificial limbs and assistive devices. However, this biosignal real-time control interface still presents several restrictions such as control limitations due to a lack of reliable signal prediction and standards for signal processing among research groups. Our paper aims to present and validate our sEMG database through the signal classification performed by the reliable forms of our Extreme Learning Machines (ELM) classifiers, used to maintain a more consistent signal classification. To perform the signal processing, we explore the use of a stochastic filter based on the Antonyan Vardan Transform (AVT) in combination with two variations of our Reliable classifiers (denoted R-ELM and R-Regularized ELM (RELM), respectively), to derive a reliability metric from the system, which autonomously selects the most reliable samples for the signal classification. To validate and compare our database and classifiers with related papers, we performed the classification of the whole of Databases 1, 2, and 6 (DB1, DB2, and DB6) of the NINAProdatabase. Our database presented consistent results, while the reliable forms of ELM classifiers matched or outperformed related papers, reaching average accuracies higher than 99% for the IEEdatabase, while average accuracies of 75.1%, 79.77%, and 69.83% were achieved for NINAPro DB1, DB2, and DB6, respectively.


Introduction
Interfaces based on biosignals supported by late developments in areas such as medicine, engineering, computation science, and microelectronics are becoming increasingly popular. Recently, surface Electromyography (sEMG) and Electroencephalography (EEG) signals have been used to offer control of assistive devices for people with some level of impairment, amputation, or specific movement restriction [1][2][3]. Despite the advances in recent sEMG signal classification for the activation of auxiliary devices [3][4][5], optimal signal processing strategies and portable devices development yet face several restrictions. Despite the deterministic range of the sEMG signal in frequency and amplitude, factors such as subject dependency and lack of signal repeatability often preclude efficient and reliable myoelectric pattern recognition and control since the first studies in the area [6,7], making the optimal sEMG signal classification an arduous task from a machine learning perspective [8][9][10][11][12]. Thus, the natural control of assistive devices based on sEMG activation is a field of constant expansion in biomedical engineering.
Usually, open-access sEMG databases have some restrictions concerning the small number of movements performed, the small number of subjects, the lack of assay repetitions, and the variations in

Experimental Protocol
The E2 NINAPro database exercise [13] inspired the 17 hand and wrist movements that form the IEE database. The 17 different movements performed interspersed by a rest class are presented in Figure 1 and are composed of three main groups. Group 1 consisted of finger movements (thumb up; extension of middle and index finger with flexion of other fingers; flexion of little and ring fingers and extension of the others; thumb flexion; abduction of all fingers; hand closure; pointing index and abduction of extended fingers). Group 2 gathered the torsion movements (wrist supination, axis: middle finger; wrist pronation, axis: middle finger; wrist supination, axis: little finger; and wrist pronation, axis: little finger). Group 3 was composed of wrist movements (wrist flexion; wrist extension; wrist radial deviation; wrist ulnar deviation; and wrist extension with the hand closed).
The characterization of subjects is pertinent to appraise the results since skinnier and younger individuals, for example, tend to reach better accuracy results, as demonstrated by Atzori et al. [14]. During the data acquisition, each one of the four untrained subjects was requested to sit comfortably on a chair positioned in front of an LCD monitor and to reproduce the movements displayed as naturally as possible according to the video stimulation. For each one of the four different assays, the four subjects were requested to repeat the trials three times, forming a subset of the database with 12 trials to each assay. The assays differed from each other according to the number of repetitions and the order of movements. Assays A and B had, respectively, six and ten repetitions of movements in sequential order (as the NINAPro databases DB2 and DB1, respectively). Trials C and D had the same amount of repetitions performed randomly. The electrode positioning was the same as proposed in the NINAPro database, as presented in Figure 2.
All the sEMG signal acquisition was performed by 12 channels formed by 24 disposable surface electrodes and a reference electrode, placed on the individual's forehead. The channels were connected to a battery-powered commercial sEMG device ( EMG 830 C, from EMG System do Brasil). The signal was digitalized at a 2-kHz sample frequency and 18 bits of quantization by a NI USB-6289 platform from National Instruments. The acquisition was performed using a notebook running LabVIEW 9 on a Windows 10 environment.
All procedures performed followed the ethical standards of the 1964 Helsinki declaration and its later amendments or comparable ethical standards and were approved by the institutional research committee under the Certificate of Presentation for Ethical Appreciation (Number 11253312.8.0000.5347). Before the experiment, each subject was requested to give informed consent and answer questions regarding clinical data including age, gender, height, weight, and laterality and also had their Forearm Circumference (F.C) and Forearm Length (F.L) measured. Those characteristics are detailed in Table 1.

The LabVIEW Interface
The LabVIEW routines were designed to interact with the hardware and also provided a cognitive walkthrough to the user, presenting to the user the next movement to be performed in a small auxiliary interface window and providing preparation time. The videos were built using the MakeHuman software for the creation of the anthropomorphic arm forms and the Blender software to put them together and time the animations. The LabVIEW routines and the videos used are available for download on our website along with the MATLAB functions and syntax used to transform the LabVIEW ".lvm" files into ".m" MATLAB files. By making this material available, we hope to enable different groups to recreate the experimental conditions and formulate their own databases, as well.

Data Relabeling
The process of labeling the samples was performed based on the timestamps generated by each movement repetition displayed as the subject stimulation. However, commonly, some delays caused by the volunteer's reaction time tended to create mismatches and frequently incorrect labels for some samples at the beginning and end of each movement. To improve the signal labeling, an algorithm based on [13], a signal filtering, and a Generalized Likelihood Ratio (GLR) model were used for relabeling.
To define the most proper label in the rest-movement-rest transitions, an exhaustive search between all possible values for the movement beginning and ending time-denoted t 0 and t 1 , respectively-was performed. The t 0 was fixed at a determined value based on the initial timestamps, while t 1 was incremented until the end of the window. After this first iteration, t 0 was incremented, and t 1 went through a new sweep. In each iteration, t 0 was incremented by 30 ms, and the Generalized Likelihood Ratio (GLR) between the rest-movement-rest sections was calculated. Subsequently, the combination of t 0 and t 1 was defined for the higher GLR value, which defined the movement beginning and end. The relabeling method, performed in MATLAB, is also available for download on our website.

Signal Filtering
Since the noise out of the interesting band of the sEMG signal was usually filtered and removed in the analog conditioning performed in the signal acquisition stage of the process, the digital filtering often aimed to remove artifacts within the bandwidth of the sEMG signal through the application of adaptive filters [18][19][20][21][22][23]. To provide a straightforward and time efficient filter alternative for the sEMG signal filtering, we developed the AVT filter. The AVT filter used in our database was designed to remove noise within the band of interest and to provide a smoother and more regular source for feature extraction, which enhanced the pattern recognition capacities of the machine learning method. The original AVT algorithm was designed using sample discard; however, we modified the original algorithm to avoid sample discarding before the classifier, as presented in Figure 3. The AVT filter, as designed, processed each segment (sg) of 200 ms extracted from the sEMG signal. Once a 200 ms segment was processed, the segmentation window slid 10 ms forward in the signal, characterizing a sliding-window approach. Considering the Mean Amplitude Value (MAV), the overall process itself was similar to a moving average filter; the main difference resides in the selective capacity provided by the Mean Signal Deviation (MSD), composed by the standard deviation of the signal (σ) rather than just the hard threshold based on the Mean Signal Amplitude (MSA) alone. After the definition of the range of excursion (MSA ± MSD), the only samples (s) to be altered were those out of its boundaries, which were replaced by the MSA value, while the remaining samples stayed intact. The threshold also suffered from the influence of two pondering filter factors (ff 1 ) and (ff 2 ). The filter factors were used to provide a low-pass behavior for the 190-ms (95% of the signal) portion of the segment and highlight the dynamics of the 10-ms (5% of the signal) incoming portion for each sliding window. For that, the values of ff 1 and ff 2 were defined as 0.8 and 0.2, respectively. The effect of the designed AVT filter on the sEMG signal is presented in Figure 4a,b, as well as its effect on the RMS feature in Figure 4c,d. The Gaussian response derived from the signal was a direct consequence of the incremental activation profile of the Motor Units' Action Potentials (MUAPs), which led to proportional EMG responses. More detailed information about the AVT filter and its comparison with other filtering and non-filtering scenarios was explored in [16].

Signal Segmentation and Feature Extraction
Generally, the definition of segment lengths and the selection of features is not a well-defined field in sEMG signal processing, with approaches varying considerably between those two factors. Despite the representativity of the sEMG signal generally being proportional to the segment length used [24], previous tests with the NINAPro database [5,12] observed that variations of 100 ms, 200 ms, and 400 ms of signal length did not offer a significant statistical difference for the results. Moreover, the use of more sophisticated frequency or time-frequency features also did not present a clear accuracy improvement compared to more simplistic signal representations for offline classification [12,25]. Furthermore, according to [26], there is no guarantee that additional individual efficient features of a model would offer a more efficient combination for signal classification considering that the systems tend to overfit. Previous studies [13,27] also concluded that due to the sEMG nature, a non-linear kernel is even more efficient for the signal classification than a specific set of features.
Although higher accuracy rates are obtained using bigger sEMG portions (windows), the process of signal buffering results in delays in the classification response that usually preclude the real-time control of assistive devices [24]. In our paper, the overlapped windows approach was chosen for the signal segmentation to maintain a balance between reasonable signal representation and the system responsiveness. Additionally, similar papers [12][13][14][15]28] were also considered to enable a fair comparison of results. Thus, overlapped windows of 200 ms of length and 10 ms of increment were used to segment the signal and extract the classical time-domain features: RMS, Variance (VAR), Mean Absolute Value (MAV), and Standard Deviation (SD), which are commonly used in sEMG signal classification [7,10,29], for each one of the 12 channels. This simple set of features for the signal representation was chosen based on previous related NINAPro works and also to highlight the consistency of our database. This approach also highlights the potential of our reliable classifiers, which can match or even outperform related results in the literature even when not using longer sEMG segments or more complicated signal representations.

Signal Classification
The signal classification was performed using state-of-the-art classifiers based on Extreme Learning Machines (ELM) in its standard (ELM) and Regularized (RELM) forms [30][31][32]. We derived a reliable version of the classifiers denoted R-ELM and R-RELM, respectively.

ELM
The ELM is a particularly attractive machine learning solution for applications that demand a quick model formation. The classifier is formed through a non-iterative method that does not require the optimization process and instead uses a Moore-Penrose pseudoinverse that allows the achievement of an optimal model considering a tolerance for error. By its nature, the method natively avoids some classical problems of more traditional machine learning solutions such as local minimal and sub-optimal solutions [33]. Furthermore, the method has a natural multiclass capacity and a very reasonable computational cost when compared to the reference classifiers in the field, such as SVM [25]. The basic ELM structure is composed of the linear system presented in Equation (1), where H is the input matrix formed by the features projected by a kernel, β is the model to be found, and T is the label matrix. The derivation of H is detailed in Equation (2), where w and b are the random weight of the network neurons and bias, attributed within a range of [−1; 1] and [−1.5; 1.5], respectively, to maintain the low-pass response of the classifier. The φ represents the Radial Basis Function (RBF) kernel that projects each one of the N features in the L hidden neurons, a kernel acknowledged to be efficient for sEMG signal classification [13,27].
For the ELM, L is the only hyperparameter to be defined. Thus, we performed preliminary tests to relate the L number and accuracies in the training and testing of the models for both ELM and RELM classifiers within a range of 50-1000 hidden neurons. The optimal number of hidden neurons was defined by the maximum accuracy rate achieved for each subject.
For the ELM and RELM, H † was calculated through the Moore-Penrose pseudoinverse and Tikhonov regularization (with C as the regularization factor), respectively, as presented in Equation (3). With H † defined, the system can be solved in a very straightforward manner, as presented in Equation (4) [33].
The final label is attributed using an argmax heuristic where the highest output value (T) among all classes takes the label. Two metrics were used to appraise the system in its standard and reliable forms, the overall accuracy, presented in Equation (5), and the weighted accuracy, which ponders the accuracy for each one of the 18 classes (c), presented in Equation (6). The overall system architecture for the sEMG signal classification is presented in Figure 5.
overall accuracy (%) = Correct classifications Total samples tested × 100% (5) weighted accuracy (%) = Correct class c classifications Total class c classifications × 100% (6) Figure 5. sEMG signal classification using a generic ELM classifier and the AVT filter. In this example, N is the number of samples to classify in C classes using d features and L hidden neurons.

Reliable Signal Classification
To define the adequate class to label a determined input, the ELM method relies on the argmax heuristic. Once the model processes the input, the likelihood of belonging to each class is derived from the argmax value. Thus, the argmax output vector retains the probability of a particular sample belong to each class. The output class label for each sample is then attributed based on the higher argmax value, which in an ideal scenario is by far higher than the remaining classes, forming a reliable classification. Using this inherent mechanism of the ELM classifier to attribute labels, and based on the interval range calculated by Equation (7), a threshold(th) value was designed to identify the non-reliable classifications; instantly ignored by our classifier. Thus, the reliable version of the classifier can maintain a more coherent and robust classification and autonomously discard outliers and poorly-fitted data. An example of the variation of the maximum argmax value according to each classification performed is presented in Figure 6. The movement transitions are well known for the lack of signal representativity and class overlap in a machine learning perspective. Those factors make the class distinction in these sections particularly challenging, precluding an ideal class separation [34], and result in lower argmax values, which lead to lower reliability of the classification for those periods. The same situation occurs in the classification ripples in the intermediate portion of the signal that provoke an erroneous classification each time the classifier fails to reach an appropriate value of argmax and adequate class separation. The solid black line presents the ideal classification in Figure 6, while the solid red line represents the predicted class output. It is possible to note that the mismatches (errors) in the signal classification tended to occur for the drops of reliability in the classification, which is represented by the dashed blue line.
th reliability = µ reliability −σ reliability The value of threshold (th reliability ) was derived from the average (µ reliability ) reliability (max(argmax)) of the signal, considering its standard deviation (σ reliability ). This factor provides a relaxation factor for the classifier given that some classifications performed with a slightly lower value than the µ reliability are often correct. Our heuristic takes as the premise that if a representative dataset trained the classifier, a reliable test sample must provide patterns that are fitted enough in the trained model to offer consistent higher values of the argmax in comparison to remaining classes. In an ideal scenario, the correct class is related to an argmax output value, which is higher than the average, while the remaining classes achieve considerably lower value, characterizing an adequate class separation. This classification, when it occurs, characterizes a reliable classification. For non-satisfactory values of the argmax, we ignored the classification performed, since it is better not to perform any action than to provide an erroneous action that may harm the user or prejudice the environment that surrounds him/her. At the same time, this method enables criteria for data discarding and can be improved in the future to decide when the model needs to retrain and which classes are in fact capable of being learned by the classifier. The reliable versions of ELM and RELM classifiers are denoted in our paper as R-ELM and R-RELM, respectively. The regular value of reliability in Figure 6 is perceptible for the rest class and so is its sudden fall in movement transition sections of the classifications. The solid red line provides the predicted class, while the blue dashed curve represents the reliability of the system for each classification performed. Classification errors tend to occur when the reliability metric drops, which indicates the non-reliable classifications, which can be autonomously identified by the classifier.

IEE Database Validation
Ideally, the data should be as representative and distinct as possible to enable the classifier's optimal accuracy. Since several experimental factors such as electrode positioning, subjects' physiology, and fatigue may alter the sEMG signal, we decided to evaluate the consistency of the IEE database. Figure 7 presents the average distribution of sEMG signals' amplitude concerning movement repetitions and the 12 trials performed in each assay. The evaluation considers all the movement repetitions and the 12 trials performed, with Trials 1-3, 4-6, 7-9, and 10-12 performed by Subjects 1, 2, 3, and 4, respectively.
The first analysis showed that different assay types (A, B, C, and D) did not result in a significant difference regarding the average sEMG signal amplitude. However, the movement repetition itself was identified as an influencing factor that resulted in one outlier value. An ANOVA (p = 0.05) was used to validate the rectified sEMG signal amplitude regularity using each assay separately. The results confirmed that the movement repetitions were executed differently by the subjects, which made them and ultimately the trials variable both significant factors. The signal dispersion concerning movement repetition was similar to E2 (Exercise 2) from the NINAPro database, as presented by Atzori et al. [13]. Regarding outliers, one of them was detected for each movement repetition of Assay A and Movement Repetitions 1 and 6 of Assay D. That was expected to occur in some assays as a consequence of the physiologic differences regarding the subjects, which was magnified by movement execution, generating distinct EMG activation profiles. Even so, Assays B and C did not present any outliers, considering the average signal amplitude.
Regarding the segmentation time, the related literature cites a trend of increasing the accuracy rate proportionally to the window length. We evaluated this aspect using the data from Assay A of Subject 1 of all repetitions to check the influence of segment window length on different metrics of the system, as presented in Figure 8 (Figure 8f). Our test evaluated intervals of 100 ms, 200 ms, 300 ms, 400 ms, and 500 ms, for the same set of our original TDfeatures. The statistical significance for all tests performed was evaluated through an ANOVA (Tukey test, p = 0.05). In this same test, the influence of ELM vs. RELM was also evaluated as a control variable. For the training accuracy rate, it was found that the segmentation length had a statistical significance (which tended to be most influenced by the 400-ms and 500-ms scenarios), while the method of classification itself did not have a significant influence on the results. This suggests that longer windows for feature extraction are most likely to benefit the formation of a more accurate model. The same result was also verified regarding the non-reliable data detected by the classifier; the more extended segmentation tended to lead the system to narrower ranges at lower plateaus, which seems to reflect a situation of non-overfitting, which would increase the discarding of samples considerably. The overall and weighted accuracy metrics presented a significant response regarding the classification method, with RELM showing higher accuracy rates, but it was not significant for the segment size variation. The results suggest that despite longer segments influencing the training accuracy, this does not necessarily translate into higher accuracy rates. Thus, despite offering lower results, the 100-ms segmentation can be applied without any significant implications, at least following the test conditions. Both overall and weighted reliable metrics did not present statistically-significant differences in the variation of segment length or classification method. This result suggests that our reliable forms of the classifiers, at least for the conditions tested, were able to mitigate the influence of segmentation length in the accuracy results, proving to be robust alternatives to sEMG signal classification. Table 2 presents the classification results using ELM and RELM models in their standard and reliable forms (denoted R-ELM and R-RELM, respectively). The results exposed in Table 2 are organized by assays, subjects, methods, and two different metrics. The overall sample to sample results (overall accuracy) are given, formed by the comparison of the ideal and predicted labels, and the weighted accuracy, which considers the weighted average results among all classes. For the reliable versions of the classifiers, the average discard rate of the non-reliable data is also present. Each result considers the three repetitions performed by each subject in each type of assay (Trials 1-3 for Subject 1, 4-6 for Subject 2, 7-9 for Subject 3, and 10-12 for Subject 4). Table 2. Mean accuracy rates achieved considering the three repetitions for each assay. The results are divided by assay, subject, method, and accuracy. For the reliable versions of the classifiers, the amount of discarded data is also present. R-RELM, Reliable Regularized ELM.  A full-factorial design of experiments (p = 0.05 and R 2 = 85.4%) was conducted to define the factors that significantly affected the accuracy rate. The assay type, both classifiers (in their standard and reliable for ms), both accuracy metrics, and the subjects were considered controlled variables, while the accuracies were treated as the response to the experiment. All the variables excluding the classifiers' individual interaction with assays and subjects and the mutual interaction with metric and assays and metric and subject were significant in the test. These results were coherent with the results presented in Table 2, which demonstrated distinct results and, therefore, the influence of all these factors on the results achieved.

Signal Classification
The overall accuracy achieved rates above 90% in all cases and very close to 100% for several scenarios of the reliable versions of the classifiers. In contrast, the weighted accuracy, which considers all the movement classes to compose the final average, was lower for all tests. The use of both metrics was pertinent given that the overall accuracy is generally used in papers in the area, and the weighted accuracy presented accuracy without the bias caused by the rest class. There was a visible difference in the weighted accuracy among all the subjects, but a reasonable coherence in the rates considering the sequential (A and B) and random (C and D) assays performed in the baseline and the reliable form of the classifiers. The number of repetitions used proved to be insignificant concerning the use of four (Assays A and C) or six (Assays B and D) movement instances to train the classifier.

NINAPro Databases' Classification
To validate and compare our classifiers with related papers, we performed the classification of the whole of Databases 1, 2, and 6 (DB1, DB2, and DB6) of the NINAPro database formed by its three different "exercises" (E1, E2, and E3) comprehending 50 different upper-limb movements (DB1, DB2) and 18 hand and force movements with assays' repetition (DB6). The average results achieved for each database are presented in Table 3. For the sake of comparison, we chose not to detail each database result, but to use an average of the result concerning the exercises E1, E2, and E3, as presented in the related papers. The results derived from DB6 were tested for two different conditions. The first Condition (CD1) consisted of the intra-session signal classification, while the second Condition (CD2) used data from different assays to train and test the classifier, as presented by Palermo et al. [15].
As Palermo et al. [15], the intra-session results were far superior to the cross-session signal classification. However, our results presented higher accuracy than those of the related paper, which used only the mean amplitude value and wavelength as input features, as presented in Table 3. Table 3. Results and comparison with related papers in the area considering the three different NINAPro databases. CD, Condition.

PAPER SEGMENT DATABASE AVERAGE ACCURACY (%)
Kuzborskij et al. [12] 400 ms + 10 ms DB1 75.00 Zhai et al. [35] 200 ms + 100 ms DB2 77.41 Gijsberts et al. [27] 400 ms + 10 ms DB2 77.48 Atzori et al. [13] 200 ms DB2 75.27 Atzori et al. [14] 400 The DB1 database gathered the same 50 distinct movements of DB2 with differences in data acquisition. The average accuracies presented in Table 3 was calculated based on the related papers' best scenario comprehending the 27 subjects of the database who performed ten repetitions of each movement. Our baseline classifiers were slightly less accurate in this database; however, the reliable form of the regularized ELM was able to match the state-of-the-art rates.
Regarding the NINAPro DB2, the results presented in Table 3 were the best case scenarios of the baseline classifiers described in the related papers composed by the average results derived from 40 subjects. Although the baseline forms of our classifiers were slightly less accurate than the accuracies of the related papers, the reliable versions of our classifiers were capable of outperforming these rates. Moreover, the length of the windows used in the segmentation process is a factor to be considered, since generally, accuracy tends to increase proportionally to the length of the sEMG signal used in the signal classification.
All the comparative tests were conducted using the same movement samples (i.e., Movement Repetitions 1, 3, 4, and 6 to train and Movement Repetitions 2 and 5 to test in DB2) and the data ratio to train and test (i.e., 50% for training and 50% for testing in DB1 and DB3) of the related papers. However, differently from the related papers, the features used were those indicated in Section 2.5 of our paper.

Discussion
Regarding the IEE database, it was perceptible that the order of execution of movements had a direct impact on the accuracy achieved. The sequential order assays (A and B) had significantly higher accuracy than those formed by random movements (C and D). In our perspective, this is caused probably by the subjects' learning capacity, implying more precise and regular movement execution concerning timestamps and the emphasis of the movement itself. In random assays, the movements can sometimes confuse the subject, generating an error factor, which generally appears magnified in Assay D, which had double the random repetitions and the lowest plateau of accuracy rates among all assays. The sequential order Assays A and B had consistent average results with a difference of ≈2% in weighted accuracy and less than 1% in the overall accuracy, which demonstrates the regularity of the classifiers in the identification of signals derived from different movement repetitions (two repetitions tested in Assay A and four repetitions tested in Assay B). These results may indicate that four representative movement repetitions were enough to train a reliable model to be tested with future n instances and still maintain the classification consistency (not considering some experimental precluding factors such as noise, electrode displacement, etc.).
A significant difference between the accuracies achieved by different subjects was also perceptible. Even using the same experimental protocol and electrode positioning, different subjects tended to present distinct bioelectric characteristics that influenced the signal classification. According to Atzori et al. [14], younger and skinnier subjects tended to achieve higher accuracy rates. Assuming the assessment of Atzori et al. [14] is true, the higher accuracy observed for Subject 2 was potentially linked to his thinner fat tissue, which may have helped in acquiring more representative sEMG signals. Subjects with thicker fat tissue usually tended to attenuate the signal amplitude, tending to preclude the optimal signal pattern recognition. Subject 2 also presented a lesser standard deviation related to the weighted accuracy and generally had among the lesser discard rates concerning non-reliable data. The worst results were achieved by Subject 4 who had a higher Body Mass Index (BMI) among the population involved in the study. An exception occurred in Assay C, where Subject 1 reached a lower accuracy due to the outlier Repetition 2 that reached an 11% accuracy rate in the weighted accuracy metric. This value is highly unusual and should be treated as an outlier.
The number of hidden neurons is an essential factor to define from the perspective of machine learning. Since this is not the main point of our paper, we decided to define the number based on the number that provided us with a higher accuracy rate. However, there are more sophisticated approaches in information theory generally based on the Bayesian or Akaike information criterion and pruning algorithms that aim to balance the amount of useful information used in the creation of the model and the computational weighted accepted as function of a proposed application. The regularized form of the baseline ELM classifier tended to achieve slightly higher results (usually ≈ 2%) due to its capacity to be more resilient to input pattern variations. For all assays and trials, the reliable forms of the classifiers were able to boost the classification accuracy in all scenarios by eliminating the non-reliable classifications. The outcome of this sample discard was observable in both accuracy metrics, but the practical effect was more visible in the weighted accuracy where the best results reached improvements close to 20%. The more outstanding improvement rate achieved occurred for Subject 4 on Assay D with more than a 23% accuracy improvement. The data discard was consistent in every assay, varying between ≈7.8% and ≈14.4% (both on Assay C). The R-RELM discarded more samples in every scenario due to the more regular value of its argmax value, a consequence of a regularized method, less sensitive to outlier sample inputs.
Regarding both metrics used, the weighted accuracy presented a more balanced alternative to evaluate the system considering that the sEMG databases are frequently unbalanced datasets, which tend to have more samples from the rest class than actual movements, causing a bias in the overall accuracy rate. Given its average lower amplitude, the rest class is the most reliable "movement" to classify [13], reaching values close to 99%. However, the overall accuracy (sample to sample classification) still provided valuable information regarding time, enabling a more precise evaluation of error occurrence in classification. Thus, it is possible to work around solutions to prevent and correct these errors, which are frequently related to the signal representativity, as presented in Figure 9, which compares the desired label (solid black line) with the predicted label (dashed red line). Despite the excellent signal prediction performed for the two sequential movements repetitions (Classes 9 and 10), the classification ripples at the end of the second repetition of Movement 9 and the middle portion of Movement 10 were present. Both ripples shifted the predicted label to Classes 10 and 11, respectively. The initial portion of the first repetition of both movements also contained a slight delay for the movement classification and in the second repetition of Movement 10. The predicted label was advanced in time compared to the ideal label. Classification errors in the signal transition are well known in the literature, although, upon an offline analysis, these errors could be in part attributed to the non-ideal relabeling process. The sample to sample plot was especially useful to identify specific classification drawbacks such as misclassification ripples. The correspondent accuracies achieved for Movements 9 and 10 using the weighted accuracy metric were 85.82% and 90.78%, respectively. Thus, although the weighted accuracy is a fairer comparison, the overall accuracy used along label comparison plots can still provide valuable information.
In comparison with related work, Atzori et al. [14] and Kuzborskij et al. [12] related average accuracy rates of 76% and 75%, respectively, in their best case scenario using Database 1 (DB1) of NINAPro. DB1 is composed of the same 52 classes of DB2 with differences regarding the experimental protocol in the signal acquisition and number of movement repetitions. To enable a fair comparison, we used the same movement repetitions to train and test our classifiers. However, two basics differences were the input features of the related work that also contemplated features in the frequency domain and the segment length for feature extraction. Despite our approach dealing only with features in the time domain and using segments half as long as the related papers, we were still capable of matching the best results related to our R-RELM method that, among others, contain the same 17 movements performed in our database. The segment length is an important factor to consider since longer signal portions tend to lead the accuracy to higher rates at the cost of system responsiveness [24]. Instead of using a sliding-window of 400 ms, we preferred to use segments of 200 ms and to leave a margin to improve in this context. The features in the frequency domain are commonly used to detach the signal representation from the amplitude-based metrics exclusively and tend to improve the accuracy rates when used in combination with time-domain features. Despite test several features in both domains, Kuzborskij et al. [12] found their best results using two time-domain features. Thus, we decided to use more straightforward features and avoid others that may overload the signal processing, which was enough to match both related papers with a difference of ≈ 1% from Atzori et al. [14], who used six different features in both domains. Regarding the papers that used the NINAPro DB2 database, as presented in Table 3, our classifiers were slightly less accurate, but compatible in their baseline forms and outperformed the referenced papers with the reliable forms of our classifiers. Once more, our classifiers were able to reach these results even using a signal segment half as long as the other papers and avoiding the use of frequency domain features, common to all of them. The paper of Zhai et al. [36] had the closest accuracy compared to our method. Zhai et al. used a deep learning classifier based on a Convolutional Neural Network (CNN) and an adaptive method to classify the NINAPro DB2; in their best results considering the baseline methods, they reached an average of 78.7%. As typical of CNN-based methods, their method relied on a considerable amount of processing to generate the models, which in their best case scenario reached an average accuracy close to 80% using their adaptive method. The reliable versions of the classifiers were able to achieve an average accuracy close to 80.0% for the NINAPro DB2 using our feed-forward method, without any retraining, keeping a more straightforward approach, but as accurate as the adaptive CNN.
The paper of Palermo et al. explored the recently-created NINAPro DB6, which is composed of signals acquired on five different days and made with the upper-limb movements focused on the gripping of different objects. Based on Kuzborskij et al. [12], the authors used the mean amplitude value and the wavelength of the signal as input features to test the accuracy of the method to classify signals from the same trial and in cross-sections of the database. On their best-case scenario by using the combination of both features, they reached an average accuracy of 52.43% for the data derived from the same trial of each user (CD1) and 25.40% when using cross-section data (CD2), mixing the data from different assays. This decrease in accuracy rate is expected since even for the same person, the characteristics of the sEMG signal tend to change in time, changing the signal morphology and, as a consequence, the features extracted. Our method was able to reach higher accuracies in both cases for all classifiers using the same length for the segmentation window and our four features in the time-domain. The reliable versions were capable of enhancing by ≈ 16% the accuracy rate in both scenarios.
The IEE database presented consistent results with both metrics and in the best case scenario of the R-RELM method achieved average results of 85.41% for accuracy rate considering all the trials and assays performed and the weighted accuracy metric. For each independent assay, the results of weighted accuracy were 90.00%, 86.95%, 80.21%, and 78.14% for Assays A, B, C, and D, respectively. For the overall accuracy metric, the same tests reached average accuracies of 99.05%, 99.21%, 98.68%, and 98.89% for the same scenarios. The higher accuracy rate in favor of the sequential and smaller assays tended to indicate the influence of the subjects on the classification process. For the sequential order assays, subjects tended to execute the movements more similarly, while random assays led to subjects to execute the movements in a more improvised way. Those small differences in the movement execution tended to affect the sEMG signal morphology and consequently its classification. Considering the possible variations of the movement execution (and the muscular activation related) and the number of repetitions, despite the initial 66% of data being able to be used successfully to train the classifiers for adjacent samples, further studies yet must approach more detailed problems derived from prolonged usage. Overall, the IEE database presented comparable rates or higher with the NINAPro database, despite the fact that it only contained movements from Exercise 2 of the NINAPro database.

Conclusions
We evaluated the IEE sEMG database, which we are making available to download on our website (www.ufrgs.br/ieelab). Additionally, we presented the reliable versions of two ELM-based classifiers and the effects of data discard and accuracy rates reached by the regularized and the standard versions of each classifier. The reliable classification appeared to mitigate even the earlier stages of processing such as the segmentation times for feature extraction. We also presented a stochastic and practical filtering method to achieve smoother features from the signal, improving its representativity to achieve reasonable accuracy rates that may in the future help with using those tested algorithms in a real-time prosthetic application. All the results were evaluated by two metrics: the overall accuracy and the weighted accuracy. We also evaluated our pre-processing strategies and classifiers using three different databases from NINAPro.
The accuracy results achieved provided the experimental validation of the gathered data and the reliable forms of the ELM classifiers. We hope the IEE database can help the scientific community by providing a benchmark database, as well as all the supplementary materials related such as codes/routines, videos, and procedures, which will hopefully support the development of natural prosthetic control methods and the general development of this research field. While using our database, we strongly encourage all the users to use both metrics of evaluation since the rest class could bias the overall accuracy result. However, the use of label comparison plots in overall accuracy is also useful to check where classification errors occur and to propose specific solutions. The reliable forms of our ELM classifiers (especially in its regularized from) were shown to match or even outperform some state-of-the-art methods using a very straightforward approach. Since we were able to identify the reliable samples autonomously for the classification, further developments must focus on the development of regenerative classifiers. In this regenerative classification, the classifier must be capable of identifying a class with poor fitting and autonomously request for a more updated sample to refresh the classification model. We believe that it must be capable of maintaining stable accuracies for long-term classification and also help to enhance the cross-session/multiuser problem of limited accuracy.
Author Contributions: V.H.C. performed the signal acquisition, all the signal pre-processing and classification after the relabeling, developed the design of experiments, developed the reliable forms of classifiers, and wrote and revised the paper. M.T. performed the signal acquisition of the IEE database, developed the LabVIEW interface, relabeled the signals, and wrote and revised the paper. J.M. performed the signal acquisition and processing, helped to evaluate the system, and wrote and revised the paper. A.B. coordinated the project, providing ideas for the signal acquisition, processing, and classification, and also wrote and revised the paper.
Funding: This research received no external funding.

Acknowledgments:
The authors would like to acknowledge the Brazilian Coordination for Improvement of Higher Level Personnel (CAPES) for the provision of the scholarships that made this work possible. The authors also want to thank professor Leia Bagesteiro for the sEMG device used in this paper, Karina Moura for the videos used in data acquisition, and all the volunteers who participated.

Conflicts of Interest:
The authors declare no conflicts of interest.