Real-Time Hand Gesture Recognition Using Surface Electromyography and Machine Learning: A Systematic Literature Review

Today, daily life is composed of many computing systems, therefore interacting with them in a natural way makes the communication process more comfortable. Human–Computer Interaction (HCI) has been developed to overcome the communication barriers between humans and computers. One form of HCI is Hand Gesture Recognition (HGR), which predicts the class and the instant of execution of a given movement of the hand. One possible input for these models is surface electromyography (EMG), which records the electrical activity of skeletal muscles. EMG signals contain information about the intention of movement generated by the human brain. This systematic literature review analyses the state-of-the-art of real-time hand gesture recognition models using EMG data and machine learning. We selected and assessed 65 primary studies following the Kitchenham methodology. Based on a common structure of machine learning-based systems, we analyzed the structure of the proposed models and standardized concepts in regard to the types of models, data acquisition, segmentation, preprocessing, feature extraction, classification, postprocessing, real-time processing, types of gestures, and evaluation metrics. Finally, we also identified trends and gaps that could open new directions of work for future research in the area of gesture recognition using EMG.


Introduction
The increase in computing power has brought the presence of many computing devices in the daily life of human beings. A broad spectrum of applications and interfaces have been developed so that humans can interact with them. The interaction with these systems is easier when they tend to be performed in a natural way (i.e., just as humans interact with each other using voice or gestures). Hand Gesture Recognition (HGR) is a significant element of Human-Computer Interaction (HCI), which studies computer technology designed to interpret commands given by humans.
HGR models are human-computer systems that determine what gesture was performed and when a person performed the gesture. Currently, these systems are used, for example, in several applications, such as intelligent prostheses [1][2][3], sign language recognition [4,5], rehabilitation devices [6,7], and device control [8].
HGR models acquire data using, for example, gloves [9], vision sensors [10], inertial measurement units (IMUs) [11], surface electromyography sensors, and combinations of sensors, such as surface electromyography sensors and IMUs [12]. Although there are different options for data acquisition, all of these options have their limitations; for example, gloves and vision sensors cannot be used by amputees; gloves can constrain normal movement, especially in cases involving the manipulation of objects; vision sensors can have occlusion problems, and changes of illumination and changes in the distance between the hands and the sensors; and IMUs and surface electromyography sensors generate noisy data [13,14]. Even though all these devices collect data related to the execution of a hand movement, surface electromyography sensors also extract the intention of the movement. This means that these sensors can also be used with amputees, who cannot execute the movements, but have the intention to do so [15,16].
Surface electromyography, which we will refer to from now on as EMG, is a technique that records the electrical activity of skeletal muscles with surface sensors. This electrical activity is produced from two states of a skeletal muscle. The first state is when a skeletal muscle is at rest, where each of the muscular cells (i.e., muscle fibers) has an electric potential of approximately -80 mV [15]. The second state is when a skeletal muscle is contracted to produce the electric potential that occurs in a motor unit (MU), which is composed of muscle fibers and a motor neuron. These electric potential differences are produced when a motor neuron activates a neuromuscular junction by sending two intracellular action potentials in opposite directions. Then, they are propagated by depolarizing and re-polarizing each one of the muscle fibers [16]. The sum of the intracellular action potentials of all muscle fibers of a motor unit is called a motor unit action potential (MUAP). Therefore, when a skeletal muscle is contracted, the EMG is a linear summation between several trains of MUAPs [15].
There are two types of muscle contractions: static and dynamic. In a static contraction, the lengths of the muscle fibers do not change, and the joints are not in motion, but the muscle fibers still contract, for example, when someone holds his/her hand still or to make the peace sign. While in a dynamic contraction, there are changes in the lengths of the muscle fibers, and the joints are in motion, for example, when someone waves their hand to do the hello gesture [17].
The EMG signals can be modeled as a stochastic process that depends on the two types of contraction described above. First, the mathematical model for a static contraction (MMSC) is a stationary process because the mean and covariance remain approximately the same over time, and the EMG depends solely on muscle force [18]. Consider (1): where N is the number of active MUs, s i (t) is the train of impulses that indicate the active moments of each MU, m i (t) are the MUAPs of each MU, and * denotes convolution. However, the MMSC can be viewed as a non-stationary process when factors, such as muscular fatigue and temperature affect the EMG [19]. Second, the mathematical model for a dynamic contraction (MMDC) is a non-stationary process, and its mathematical model is similar to the amplitude's modulation (AM modulation): where a(t) is a function that indicates the intensity of the EMG signal (i.e., information signal), w(t) is a unit-variance Gaussian process representing the stochastic aspect of the EMG (i.e., carrier signal), and n(t) is the noise from the sensors and biological signal artifacts [17,20]. The mathematical models of EMG are not used in HGR due to the difficulty of parameter estimation in non-stationary processes. However, machine learning (ML) methods are widely used because ML can infer a solution for non-stationary processes [21] using several techniques; for example, covariate shift techniques [21,22], class-balance change [22], and segmentation in short stationary intervals [23]. HGR using ML is just one approach to myoelectric control [24], which uses EMG signals to extract control signals to command external devices [25,26], for example, prostheses [1], drones [8], input devices for a computer [27], etc. There are other approaches that include conventional amplitude-based control, and the direct extraction of neural code from EMG signals. In conventional amplitude-based control, one EMG channel controls one function of a device (e.g., hand open is assigned to one channel, and hand closed to a second channel). When the amplitude of this EMG exceeds a predefined threshold, this function is activated [28][29][30][31]. The direct extraction of neural code from EMGs is another approach, in which the motor neuron spike trains are decoded from EMG signals to translate into commands [32][33][34][35].
For many applications, HGR models are required to work in real time. A human-computer system works in real time when a user performs an action over the system, and this system gives him/her a response fast enough that it is perceived as instantaneous [25]. Moreover, the response time in a real-time human-computer system is relative to its application and user perception [36]. For this reason, the controller delay, which is the response time of an HGR model, has been widely researched. For instance, a user does not perceive any delay when the controller delay is less than 100 ms in the control of devices, such as a key or a switch [36,37]. In HGR using EMGs, Hudgins & Parker et al. [38] stated that the acceptable computational complexity is limited by the controller delay of the system, which must be kept below 300 ms to reduce the user-perceived lag. This optimal controller delay was generally agreed upon by many researchers [39,40]. However, there have been several optimal controller delays reported in the scientific literature, namely 500 ms [41], and 100-125 ms [42] using a box and blocks test, which is a target achievement test.
Most of the real-time HGR models are evaluated using metrics for machine learning, such as accuracy, recall, precision, F-score, R 2 error, etc. However, this evaluation fails to reflect the performance exhibited in online scenarios as it does not account for the adaptation of users to non-stationary signal features [43][44][45][46][47]. For example, Hargrove et al. [48] demonstrated that the inclusion of transient contractions (i.e., non-stationary signals) in the training data decreases the accuracy, but improves the user performance in a real-time virtual clothespin task. Therefore, in order to evaluate the real-life performance, the real-time HGR models can be evaluated using target achievement tests, such as the box and blocks test [42,49], target achievement control test [50], and Fitts' law test [51], which is an international standard in HCI (ISO9341-9).
Currently, there are many primary studies regarding real-time HGR models using EMG and ML, which, in several cases, do not have standardized concepts, such as types of models, real-time processing, types of hand gestures, and evaluation metrics. This standardized knowledge is essential for reproducibility and requires a Systematic Literature Review (SLR) of the current primary studies. To the best of our knowledge, there is no SLR regarding these HGR models. Therefore, we developed this SLR to present the state-of-the-art of the real-time HGR models using EMG and ML. Based on this SLR, we make three contributions to the field of HCI. First, we define a standard structure of real-time HGR models. Second, we standardize concepts, such as the types of models, data acquisition, segmentation, preprocessing, feature extraction, classification, postprocessing, real-time processing, types of gestures recognized, and evaluation metrics. Finally, we discuss future work based on the research gaps we identified.
Following this introduction, the article is organized as follows: in Section 2 we describe the methodology used to execute this SLR; in Section 3 we outline the results and the discussion of the data extracted from the primary studies; and Sections 4 and 5 contain the conclusions and future work respectively.

Methodology
We developed an SLR based on the methodology proposed in [52,53], which is comprised of five stages: Research Questions (RQs), Search of Primary Studies, Analysis of Primary Studies, Data Extraction, and Threats to Validity.

Research Questions
In this stage, we define the following four research questions according to the research goal, which is to investigate the state-of-the-art of real-time HGR models that use EMG and ML:

Search of Primary Studies
In this stage, we search for the primary studies that can answer the four RQs stated in the previous section. This stage has three parts, which were done manually. In the first part, we selected the literature repositories. In the second part, we extracted the keywords of the RQs, and we developed the search strings using these keywords. Finally, we searched the primary studies in the literature repositories using the search strings.
We used four literature repositories: IEEE Xplore, ACM Digital Library, Science Direct, and Springer. We chose these repositories as they have the most primary studies on real-time HGR models that use EMG and ML and also because these repositories have peer-reviewed papers.
The extracted keywords from the RQs (see Section 2.1) are electromyography, hand gesture recognition, real-time, box and blocks, target achievement control, and Fitts' law. We, then, added the acronym of electromyography (i.e., "EMG"), and real-time variations: online, real time, on line, and on-line. Therefore, the 11 keywords used in this SLR are electromyography, EMG, hand gesture recognition, real time, real-time, online, on line, on-line, box and blocks, target achievement control, and Fitts' law. Table 1 shows the 16 Search Strings (SS), which were developed with the combination of these 11 keywords and the Boolean operator "AND". We do not use the keyword myoelectric control because this SLR is focused on HGR using EMG and ML, which is just one segment of the approaches to myoelectric control (see Section 1). Table 1. Search strings used to find primary studies.

SS1
"Electromyography" AND "Hand Gesture Recognition" AND "Real Time" SS2 "Electromyography" AND "Hand Gesture Recognition" AND "Real-Time" SS3 "Electromyography" AND "Hand Gesture Recognition" AND "Online" SS4 "Electromyography" AND "Hand Gesture Recognition" AND "On line" SS5 "Electromyography" AND "Hand Gesture Recognition" AND "On-line" SS6 "Electromyography" AND "Hand Gesture Recognition" AND "box and blocks" SS7 "Electromyography" AND "Hand Gesture Recognition" AND "target achievement control" SS8 "Electromyography" AND "Hand Gesture Recognition" AND "Fitts' law" SS9 "EMG" AND "Hand Gesture Recognition" AND "Real Time" SS10 "EMG" AND "Hand Gesture Recognition" AND "Real-Time" SS11 "EMG" AND "Hand Gesture Recognition" AND "Online" SS12 "EMG" AND "Hand Gesture Recognition" AND "On line" SS13 "EMG" AND "Hand Gesture Recognition" AND "On-line" SS14 "EMG" AND "Hand Gesture Recognition" AND "box and blocks" SS15 "EMG" AND "Hand Gesture Recognition" AND "target achievement control" SS16 "EMG" AND "Hand Gesture Recognition" AND "Fitts' law" We looked for the published primary studies from 1 January 2013 to 31 December 2019 (i.e., the last day of search in the literature repositories) using the 16 search strings shown in the Table 1. Table 2 shows the 1485 primary studies, which were found in the four literature repositories, IEEE Xplore: 397, ACM Digital Library: 400, Science Direct: 329, and Springer: 359.
We discarded 1021 duplicated primary studies of the 1485 primary studies (IEEE Xplore: 206, ACM Digital Library: 273, Science Direct: 276, and Springer: 266). Additionally, we added 23 primary studies to this SLR using the snowballing techniques, which identify the articles that have cited the primary studies found in the literature repositories (i.e., forward snowballing), and the articles from their references (i.e., backward snowballing) [54] (see Table 2). Therefore, we obtained 487 primary studies in total. Figure 1 shows the resulting primary studies after each action carried out in the two stages: the search of primary studies and the analysis of primary studies.

Analysis of Primary Studies
We filtered the 487 primary studies based on the analysis of the titles, abstracts, and conclusions using the inclusion and exclusion criteria, and the assessment questions (see Figure 1). We finally selected 65 primary studies (see Table 3), which were used to answer the four RQs (see Section 2.1).

Quality Assessment
We defined three assessment questions to evaluate the comprehensiveness, reliability, and applicability of the primary studies. For each question, we established three possible answers with their scores: "Yes" = 1, "Partly" = 0.5, and "No" = 0. Thus, a primary study was rejected if the mean of the three answers is less than 2. The three assessment questions are:

•
Were the research objectives of the primary studies clear? • Was the contribution of the primary study clear? • Was the structure of the HGR model shown?

Data Extraction
We extracted the data shown in Table 5 from the 65 selected primary studies (SPS), shown in Table 3. This extraction was performed in order to answer the four RQs (see Section 2.1).

Threats to Validity
We discuss the following possible threats to the validity of this SLR and the mitigation of these threats: an incomplete selection of the SPS, inaccurate data extraction, and biased quality assessment.

Incomplete Selection of the SPS
There is a possibility that relevant studies have been omitted for two reasons. The literature repositories may not have had all relevant studies for the four RQs, and the search strings may not have been appropriate for the four RQs. However, the authors performed the following three actions to mitigate these two threats: (1) We developed this SLR based on the Kitchenham methodology [52,53], which was shown in Section 2. (2) In this SLR, the four literature repositories and the ten search strings were proposed by the first author, and the second and third authors assessed the relevance of these literature repositories and search strings. The four literature repositories were assessed in accordance with the criterion that these repositories are the most used in the ML area. The ten search strings were assessed based on the criterion that the keywords and the structures of the search strings are relevant to the four RQs. (3) We applied the snowballing techniques [54] to add 14 SPS to the SLR. This task was performed by the first author, and the third author assessed the relevance of these 14 SPS.

Biased Analysis of Primary Studies
The analysis of the primary studies (see Section 2.3) can be biased for two reasons. The inclusion and exclusion criteria may not be relevant to the four RQs, and the SPS may not be comprehensive, reliable, and applicable. To mitigate these two threats, the authors performed the following two actions: (1) The authors developed formal inclusion and exclusion criteria (see Section 2.3.1) and quality assessment criteria (see Section 2.3.2). These criteria were proposed by the first author, and they were assessed by the second and third authors. (2) The first author selected 65 primary studies reading the title, abstract, and conclusions. However, the first author also read the whole study when the title, abstract, and conclusions were not clear. Furthermore, these 65 SPS were assessed by the second and third authors.

Inaccurate Data Extraction
Generally, the data extracted can be inaccurate for two possible problems: unsystematic data extraction, and the data not being relevant to the RQs. To solve these problems, we extracted the data using a systematic methodology based on the four RQs (see Section 2.4). Moreover, the authors made sure that the extracted data answer the four RQs.

Results and Discussion
The data extracted from the 65 SPS (see Table 3) are presented and analyzed in five subsections: the study overview subsection and the other four subsections, one per each RQ (see Section 2.1). Although some SPS presented more than one HGR model, we selected the models with the best performance in the evaluation; therefore, we used 65 HGR models for this review.

Study Overview
The study overview shows a general vision of the settings used in the SPS. Among other data, we decided to extract the publication year and the type of publication. Figure 2a shows the number of SPS per year, which has increased steadily since 2013. Moreover, in Figure 2b, we show that most of the SPS were presented in conferences, also see Table 3.

Results of the RQ1 (What Is the Structure of Real-Time HGR Models Using EMG and ML?)
We found that the structures of the 65 real-time HGR models are not regular across the studies. However, they have some stages in common, such as Data Acquisition (DA), Segmentation (SEGM), Preprocessing (PREP), Feature Extraction (FE), Classification (CL), and Postprocessing (POSTP). We present a standard structure, considering the frequent stages after they were assembled, the result is illustrated in Figure 3. Note that there are SPS that did not use all stages of the standard structure because Segmentation, Preprocessing, Feature Extraction, and Postprocessing are optional stages (i.e., without them a model is still feasible). Table 6 shows the stages of the standard structure used by the SPS. Aside from the structure of the models, we identified two types of models: the individual models and the general models. Individual models are trained relying on the gestures (data) of a person and recognize the gestures of that same person. General models are trained with the data of several people and recognize the gestures of any person. We found 44 35 is the only general model that was evaluated using EMG data from people who did not participate in the training phase. The other 10 general models only used EMG data from people who participated in the training; therefore, it is not possible to conclude that these 10 models are able to recognize gestures of any person.

Data Acquisition
In the Data Acquisition stage, EMGs are acquired from EMG sensors, which can be part of homemade or commercial devices. Table 7 shows the number of sensors, the sampling rates, and the acquisition devices used in the HGR models. We found that 27 SPS 49,and SPS 55) is 1000 Hz because these SPS indicate that the sampling rate must be at least twice the highest frequency of the EMG, according to the Nyquist sampling theory, and approximately 95% of the signal power in the EMG is below 400-500 Hz [114][115][116]). Table 7 also shows the use of commercial devices, including the Myo armband from Thalmic Labs Inc., the MA300 from Motion Lab Systems Inc., the Bio Radio 150 from Cleveland Medical Devices Inc., the ME6000 from Mega Electronics Ltd., the Analog Front End (ADS1298) from Texas Instruments, the Telemyo 2400T G2 from Noraxon, and the EMG-USB2 from OT Bioelettronica. Furthermore, two models (SPS 43 and SPS 45) use high-density EMG sensors.

Segmentation
EMGs are partitioned into multiple segments or windows using different techniques, such as gesture detection and sliding windowing (see Table 7). Gesture detection computes the beginning and the end of a hand gesture, and returns the EMG that only corresponds to muscle contraction. Therefore, the segment lengths are variable as they depend on the duration of the hand gestures. The sliding windowing techniques partition the EMG into fixed adjacent segments (i.e., adjacent sliding windowing) or fixed overlapping segments (i.e., overlapping sliding windowing) (see Figure 4). By increasing the window length, up to a certain point, the controller delay increases, and also the accuracy of the models increase as more data are collected for recognition [25,40].

Preprocessing
HGR models use preprocessing techniques that transform the EMG into an input signal for Feature Extraction or for the ML algorithm if the structure of the HGR model does not have Feature Extraction (see Table 6). For example, a common preprocessing technique is the use of a Notch Filter at 50 or 60 Hz that eliminates the AC frequency of the powerlines (SPS 10). Other examples include Offset Compensation, Pre-smoothing, Filtering, Rectification, Amplification, and the use of the Teager-Kaiser-Energy Operator (see Table 7). Offset Compensation is a technique that eliminates noise through the compensation of the average value of the EMG: where, x 1 , x 2 , . . . , x n are the raw EMG values,x is the average value of the signal, and (x 1 −x), (x 2 − x), ..., (x n −x) are the EMG values after the use of offset compensation. Pre-smoothing is a technique that computes the mean of the last m values of the EMG and then sets the mean to the current value x n of the signal: EMG raw = (x 1 , x 2 , ..., x n ) where, x 1 , x 2 , . . . , x n are the raw EMG values and x n is the current value that is based on the mean of the m previous values of the raw EMG. Filtering is a technique that removes some unwanted frequencies or an unwanted frequency band from the raw EMG. Rectification transforms the negative values into positive values (e.g., absolute value function). The Teager-Kaiser-Energy Operator increases the signal-to-noise ratio to improve the muscle activity onset detection of a gesture [117]. The most used preprocessing technique is filtering (see Table 7).

Feature Extraction
Feature extraction techniques map the EMG into a feature set. These techniques extract features in different domains, such as time, frequency, time-frequency, space, and fractal. Table 8 shows the domains of the feature extraction techniques used by the models. Most of the real-time HGR models use time-domain features because the controller delay for their computation is less than the controller delay for the computation of features in other domains (see Table 9). The mean absolute value is the most used feature in the 65 studies analyzed.

Time-Domain Features
Mean absolute value (MAV), root mean square (RMS), waveform length (WL), zero crossings (ZC), fourth-order autoregressive coefficients (AR-Coeff), standard deviation (SD), variance (VAR), slope sign changes (SSC), mean, median, integrated EMG (iEMG), sample entropy (SampEn), mean absolute value ratio    Many works perform an analysis of some of the stages shown in Section 3.2 to determine the best structure to improve the accuracy of the HGR models, for example, data acquisition [39,48,118,119], optimal window length [120], filtering [121,122], feature extraction [123], and classification [124,125] stages. However, the results are inconclusive because the structure of the HGR models depend on the environment in which the models are developed (i.e., the data sets used, the people who participated in the evaluation, the application of the models, etc.)

Controller Delay of the HGR Models
The controller delay is the sum of two values, which are the data collection time (DCT) (i.e., window length) and the data analysis time (DAT) [39,42]. In real-time processing, the DCT and DAT should be as short as possible, but the DCT also should allow the HGR model to collect enough EMG data to recognize a hand gesture. For instance, in prosthesis control, the optimal DCT using four EMG sensors with a sampling rate of 1 kHz should be between 150-250 ms [120].
An HGR model using EMG is considered to work in real-time when the response time (i.e., controller delay) is less than the optimal controller delay. There are several optimal controller delays reported in the scientific literature, namely 300 ms [39], 500 ms [41], and 100 ms for fast prosthetic prehensors and 125 ms for slower prosthetic prehensors [42].
In accordance with the Inclusion and Exclusion Criteria (see Section 2.3.1), all 65 HGR models indicate that they are real-time models. However, there are some SPS that did not report the controller delay (i.e., DCT and DAT) of their HGR models. Table 10 shows the DCT and DAT of the SPS.

Hardware Used
The controller delay of the HGR models not only depends on their structure but also on the hardware used to process the models. For example, an HGR model may not work in real-time if the user perceives delays in the HGR response because the device has limited processing capabilities. The same HGR model may also be considered to work in real-time in another device with better processing capabilities. For this reason, when a model is described, it is fundamental to indicate the hardware characteristics of the devices used for running an HGR model. Table 10 shows the two types of hardware used, which are personal computers and embedded systems. Ten HGR models were processed in personal computers, such as laptops, desktops, etc., five HGR models were processed in embedded systems, and the remaining models did not indicate the hardware used.

Number of Gestures Recognized
The number of gestures recognized is the number of classes of an HGR model. There are HGR models that have the same number of gestures, and each model has different gestures. For example, there are two HGR models that recognize four gestures, but the classes of the first model are thumb up, okay, wrist valgus, and wrist varus (SPS 14), and the classes of the second model are hand extension, hand grasp, wrist extension, and thumb flexion (SPS 22). Hence to compare these models, it is important to consider the difference in the gestures as well.

Type of Gestures Recognized
The hand gestures, according to the type of movement, are classified as static and dynamic. A static gesture is made when the skeletal muscles are in constant contraction (i.e., there is no movement during the gesture), and in a dynamic gesture, the skeletal muscles are in contraction, but it is not constant, which indicates that there is movement during the gesture.
The EMG data generated by a gesture has two states: transient and steady. The EMG data in the transient state are generated during the transition from one gesture to another, and the EMG data in the steady state are generated when a gesture is maintained [38]. Moreover, the offline classification of hand gestures using EMG data in the steady state is more accurate than in the transient state as the variance of the EMG data in the transient state varies more (i.e., non-stationary process) than in the steady state over time [40]. However, in the training phase, the inclusion of EMG data in the transient state improves subject performance in a real-time virtual clothespin task [46,48]. Figure 5 presents the EMG data of a person who made a long-term gesture (i.e., gestures that lasted a long time) after a relaxed position or rest gesture. In this figure, the EMG data in the transient state are generated during the transition from the rest gesture to the peace sign, and the EMG data in the steady state are generated when the peace sign is maintained. The short-term gestures (i.e., gestures that lasted only a short time) generate more EMG data in the transient state than in the steady state as most of the time is spent in transitions from one gesture to another (see Figure 6).
The durations of the gestures used by the models are shown in Table 11. This table shows seven aspects about the gestures recognized by the HGR models reviewed in this SLR, such as the number of classes, the number of gestures per person in the training set (NGpPT), the number of people who participated in the training (NPT), the number of gestures per person in the evaluation set (NGpPE), the type of gestures recognized, the state of the EMG data used, and the duration of the gestures (DG). NGpPT, NPT, and DG show the EMG data used to train the individual (NGpPT × DG), and general (NGpPT × NPT × DG) models. We found that 63 out of 65 HGR models recognized static gestures, and only one HGR model recognized both dynamic and static gestures (SPS 25); moreover, no HGR model recognized only dynamic gestures. Additionally, six SPS used EMG data in the steady state, two SPS used EMG data in the transient state, three SPS used EMG data in the steady and transient states, and the remaining HGR models did not indicate the state of the EMG data. There were 31 out of the 65 HGR models that considered the rest gesture (i.e., the hand does not make any movement) as a class.   Table 11. The number of gestures recognized (i.e., classes), number of gestures per person in the training set (NGpPT), the number of people who participated in the training (NPT), the number of gestures per person in the evaluation set (NGpPE), the type of gestures recognized, the state of the EMG data used, and the duration of the gestures (DG).  Finally, 5 out of the 65 HGR models (SPS 59, SPS 60, SPS 62, SPS 63, and SPS 64) recognized static gestures simultaneously to control multiple degrees of freedom of a prosthesis, which replicates simultaneous movements, such as wrist rotation and grasp to turn a doorknob. The remaining HGR models recognized gestures sequentially.

Results of the RQ4 (What Are the Metrics Used to Evaluate Real-Time HGR Models Using EMG and ML?)
According to the type of evaluation (see Section 1), we divide the SPS into two groups. HGR models evaluated using metrics for machine learning (56 models), and target achievement tests (nine models).

HGR Models Evaluated Using Metrics for Machine Learning (from SPS 1 to SPS 56)
These 56 HGR models used 13 evaluation metrics (see Table 12), such as accuracy (9), recall (10), precision (11), accuracy per user (12), recall per user (13), precision per user (14), median of the accuracy per user (15), standard deviation of the accuracy per user (16), standard deviation of the accuracy per class (17), standard deviation of each user accuracy (18), standard deviation of the recalls of each class (19), classification error (20), and Kappa index (21). The accuracy is the metric most used, Table 12 shows the evaluation metrics used by these 56 models. The formulas of these evaluation metrics are: Accuracy user(i) = ∑ g j,k=1 n i,j,k ∑ g j=1 ∑ g k=1 n i,j,k Recall user(i)class(k) = n i,k,k ∑ g j=1 n i,j,k Precision user(i)class(j) = n i,j,j ∑ g k=1 n i,j,k Median(Accuracy user (1) , Accuracy user (2) , ..., Accuracy user(u) ) where n i,j,k is the number of gestures made by the user i, which were recognized by the model as j but they were k. i I = i 1 , i 2 , ..., i u is the set of test users, j J = j 1 , j 2 , ..., j g is the set of predicted classes, k K = k 1 , k 2 , ..., k g is the set of actual classes, u is the total number of test users, and g is the number of classes. We identified five machine-learning metrics that evaluate the entire HGR model. The first one is accuracy, which is the fraction of gestures recognized correctly among all the test data. Second, the recall is the fraction of gestures recognized correctly for a class among the test data of this class. Third, the precision is the fraction of gestures recognized correctly of a class among the gestures recognized by the HGR model as this class. Fourth, the standard deviation of the accuracy per user is the amount of dispersion of the recognition accuracies per user. Finally, the standard deviation of the accuracy per class is the amount of dispersion of the recalls of a particular model.
These metrics can produce biased results for two reasons: an incorrect definition of a true positive, and an unbalanced test. In order to determine the recognition accuracy, a gesture is considered as a true positive (i.e., the gesture is recognized correctly) when the HGR model determines what gesture was performed and when this gesture was performed by a person. However, only SPS 51 is evaluated in this way. Eleven HGR models (SPS 2, SPS 5, SPS 6, SPS 7, SPS 8, SPS 9, SPS 19, SPS 20, SPS 34, SPS 35, and SPS 36) determine the classification accuracy because they only took into consideration what gesture was performed by a person as a true positive, and the remaining models do not show what they consider a true positive.
In addition, the test set is balanced when it has the same number of samples per class and the same number of samples per user (see Table 13). For example, if an HGR model is evaluated using a set that has more data for the user A, the accuracy of this model and the accuracy of the user A tend to be the same.
There are five SPS (SPS 2, SPS 5, SPS 8, SPS 9, and SPS 18) in which the evaluation was performed with data acquired without feedback (i.e., the correctness of classification was not provided in the evaluation), thus people cannot adjust their movements to the HGR model. Eight SPS were performed with data acquired with feedback from the HGR model (SPS 1, SPS 4, SPS 11, SPS 12, SPS 13, SPS 17, SPS 20,and SPS 29), and the remaining SPS do not indicate information about feedback. Table 13 shows the recognition accuracies, the number of people who participated in the evaluation, type of data set (i.e., balanced or unbalanced), and the use of Cross-Validation by the 56 HGR models. The largest number of people is 80 (SPS 23). Three HGR models were evaluated using EMG data from amputees (SPS 6, SPS 21, and SPS 48). Moreover, 19 HGR models use cross-validation, that is, a technique used to minimize the probability of biased results in small data sets (see Table 13). Table 13. The accuracy, number of people who participated in the evaluation, type of data set (i.e., balanced or unbalanced), and the use of cross-validation by the 56 HGR models.   Table 14).

Metric Description
Throughput Ratio between the index of difficulty and the movement time, which is the time (in seconds) [107].
Path Efficiency Ratio between the straight line distance and the actual distance traveled [107,126].

Overshoot
Ratio between overshoots and number of targets. The ability to stop on a target [107,126].
Average Speed Average nonzero speed of the cursor over the course of the trial [107,126].

Completion Rate
Ratio between the completed trials and the number of trials within the allowed time (i.e., trial time) [50,126].

Stopping Distance
Total distance traveled (path length) during the dwell time [108].

Completion
Time Time from movement initiation to the completion of the trial [31].

Real-time Accuracy
Ratio between correct predictions and number of predictions during the completion time [127].

Length Error
Ratio between distance beyond the total required distance, and the total required distance [31].

Reaction Time
Time from a target appearance and the first move of the cursor/virtual prosthesis [113].
A motion test was proposed by patients with targeted muscle reinnervation to evaluate the myoelectric capacity [128]. These patients should maintain a gesture until the HGR model has made a predetermined number of correct predictions. In TAC, the patients control a virtual prosthesis to obtain a target for a dwell time, which is generally 1 s [50]. These patients have a trial time to get the target, which is generally 15 s. FLT is a similar test to TAC, but the users control a circular cursor with two or three degrees of freedom. FLT states that there is a trade-off between speed and accuracy [51,108], which is defined by: where MT is the movement time, a and b are empirical constants, and ID is the index of difficulty (ID) of a target (see Equation (23)), which is calculated using the distance (D) from an initial point to a target, and the width (W) of the target. Throughput is a metric proposed by Fitts, which is the ratio between the ID and MT (see Equation (24)), to summarizes the performance of a control system. The results of FLT are reliable when this test combines a variety of IDs [129].
The people who participated in these tests received feedback (i.e., the correctness of classification was provided in the evaluation). Four out of these nine HGR models were evaluated with four amputees (SPS 63), two amputees (SPS 59, and SPS 64), and one amputee (SPS 65).
In order to achieve concluding results, it is necessary to consider the sample size, which is the number of people who participated in the evaluation (n 1 ) (see Table 11) times the number of gestures per person (n 2 ) (see Table 13), to allow us to obtain statistically significant results. Using the typical values of a statistical hypothesis test (confidence level of 95%, margin of error of 5%, and population portion of 50%), we estimated n 1 according to the Normal Distribution using the Central Limit Theorem (25), and n 2 according to the Hoeffding's inequality (26), which is widely used in machine learning theory.
where, z is the critical value of the normal distribution for a confidence level of 95%, is the margin of error, p is the population portion, and α is the confidence level. Therefore, the sample size (n 1 * n 2 ) gestures of the test set must be in the order of hundreds of thousands. None of the works present so far considered these values to achieve a significant result. In the scientific literature, many EMG data sets are available [130], but, according to the best of our knowledge, the data set with the higher n 1 is 30 [131], and with the higher n 2 is 40 [84,132].

Conclusions
This SLR analyzes works that propose HGR models using surface EMG and ML. Following the Kitchenham methodology, we introduced four RQs based on the main goal of this SLR, which was to analyze the state-of-the-art of these models. To answer these four RQs, we presented, analyzed, and discussed the data extracted from 65 selected primary studies. Below are our findings in regard to the four RQs.
Structure: The structure of the models studied varies from one work to the other. However, we were able to examine the structure of these models using a structure composed of six stages: data acquisition, segmentation, preprocessing, feature extraction, classification, and postprocessing. Under this standard structure, we studied the types of HGR models, the number of EMG sensors, the sampling rate, sensors, segmentation and preprocessing techniques, extracted features, the domain of the extracted features, and the ML algorithm. The most used structure is: eight EMG sensors, a sampling rate between 200 Hz and 1000 Hz, overlapping sliding windowing, filtering (segmentation), mean absolute value (feature extraction), support vector machines, and feedforward neural networks (classification).
Controller delay and hardware: The controller delay of gesture recognition models is the sum of two values: data collection time (DCT) and data analysis time (DAT). A recognition model works in real-time when this sum is less than an optimal controller delay. However, the works analyzed report several optimal controller delays for different applications, suggesting that the optimal controller delay is relative to the user perception and the application of a recognition model.
Number and types of gestures recognized: The 65 works analyzed propose models that recognize different number and types of gestures: 31 works took into consideration the rest gesture as a class to be recognized; only one model recognized both static and dynamic gestures; and the remaining models recognized static gestures only. No model recognized dynamic gestures only as most of the EMG data generated by dynamic gestures are in the transient state. Recognizing gestures using EMG data in the transient state is more complex than in the steady state because the latter behaves as a non-stationary process. The classification of the hand gestures using EMG data in the steady state is more accurate than in the transient state, and only nine works recognized short-term gestures (i.e., using EMG data in the transient state).
Metrics and results: We divided the SPS according to the types of evaluation, which are machine-learning metrics and target achievement tests. 56 SPS evaluated their models using machine learning metrics. We found 13 machine-learning metrics and three target achievement tests. The training and testing protocols vary among the works making the comparison of their performance very difficult. Moreover, taking into consideration that many works do not describe these protocols and the whole structure of the model, one key point is the significance and reproducibility of the results. Using the normal distribution for the number of people, and the Hoeffding's inequality for the number of gestures per person, we estimated that the sample size of the test set must be in the order of the hundreds of thousands to obtain a result with a confidence level of 95% and a precision of 5%. None of the works analyzed utilize a test set of this magnitude, and therefore the confidence and reproducibility of their results are questionable. Based on the definition a true positive, only one out of the HGR models, which used machine-learning metrics, was evaluated using the recognition accuracy; the remaining models were evaluated using classification accuracy as they only took into consideration what gesture was performed by a person as a true positive.

Future Work
Based on this SLR, we identify the possible future works in this field:

•
Research the optimal permitted delay to determine a general criterion of real-time processing in HGR models using EMG and ML.

•
Develop models using EMG and ML to recognize gestures of long and short duration. Therefore, these models must be able to recognize gestures using EMG data in the transient and steady states.

•
Develop evaluation methods for the HGR models using EMG and ML that state the test sets, metrics, and protocol of evaluation.

•
Develop general HGR models using EMG and ML that can be used by people who do not participate in the training of these models.

•
Develop recognition models that not only recognize one gesture but a sequence of movements.
Funding: This research was funded by Escuela Politécnica Nacional through the research project PIJ-16-13.

Acknowledgments:
The authors gratefully acknowledge the financial support provided by Escuela Politécnica Nacional for the development of the research project PIJ-16-13. We also thank Marco Segura, Carlos Anchundia, Patricio Zambrano, and Jonathan Zea for comments that greatly improved this manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: