Surface EMG Statistical and Performance Analysis of Targeted-Muscle-Reinnervated (TMR) Transhumeral Prosthesis Users in Home and Laboratory Settings

A pattern-recognition (PR)-based myoelectric control system is the trend of future prostheses development. Compared with conventional prosthetic control systems, PR-based control systems provide high dexterity, with many studies achieving >95% accuracy in the last two decades. However, most research studies have been conducted in the laboratory. There is limited research investigating how EMG signals are acquired when users operate PR-based systems in their home and community environments. This study compares the statistical properties of surface electromyography (sEMG) signals used to calibrate prostheses and quantifies the quality of calibration sEMG data through separability indices, repeatability indices, and correlation coefficients in home and laboratory settings. The results demonstrate no significant differences in classification performance between home and laboratory environments in within-calibration classification error (home: 6.33 ± 2.13%, laboratory: 7.57 ± 3.44%). However, between-calibration classification errors (home: 40.61 ± 9.19%, laboratory: 44.98 ± 12.15%) were statistically different. Furthermore, the difference in all statistical properties of sEMG signals is significant (p < 0.05). Separability indices reveal that motion classes are more diverse in the home setting. In summary, differences in sEMG signals generated between home and laboratory only affect between-calibration performance.


Introduction
Limb amputation refers to the remove all or part of an upper or lower extremity. When people lose their upper limbs, many activities of daily living are significantly limited, as they interact with their surroundings and perform sophisticated tasks with their hands. According to hand and upper-limb reconstruction statistics provided by the NHS [1], the total number of amputations in the United Kingdom is estimated to be 250,000, with 10,000 increments per year. One out of four people with limb loss is an upper-limb amputee.
Prostheses aim to replace lost limbs and restore functionality. Myoelectrically controlled prostheses are state-of-the-art devices that intuitively interpret muscle signals to control the prostheses. In developing control schemes for myoelectrically controlled prostheses, control schemes have evolved from the initial on-off control to the two most popular methods, namely proportional amplitude control and pattern-recognition-based control [2]. Conventional proportional control schemes with two electrodes control the prosthesis with one degree of freedom and vary the control voltage according to the amplitude of the sEMG signals, providing robust performance but limited functionality [3]. Both control schemes can provide reasonable controllability for prostheses. Despite advancements in myoelectric control of prostheses, the prosthetic abandonment rate has not changed significantly since Surface EMG signals were obtained from [21], acquired from eight targeted-musclereinnervated (TMR) transhumeral amputees with myoelectric prostheses using experience over six to eight weeks at home and in the laboratory. However, only data for seven participants were available to us; one contained home trial data, and another had a failure channel. Hence, we used the data of five participants in this study. The participants used custom-fabricated prostheses; a Boston Digital Elbow (Liberating Technologies Inc., Holliston, Massachusetts, USA), a Motion Control Wrist Rotator (Motion Control Inc., Salt lake City, Utah, USA), and a single-degree-of-freedom terminal device (a powered split hook or hand). The prostheses were embedded with eight stainless-steel electrodes sampling at 1000 Hz. These eight electrodes were grid-arranged [22] and placed on the wall of the prosthesis liner.
Before and after the home trial, several tests were performed in the laboratory to evaluate the prosthetic control performance of each participant. The goal was to identify optimal electrode sites inside the socket and make the amputee confident about using the device. The user was then included in the trial and sent home with the device. During the home trial, participants were instructed to control the prosthesis to perform activities of daily living and to record the use frequency and activities performed using the prosthesis. Calibration sessions at home were at the discretion of the participants. They could calibrate after donning or any time they noticed a decrease in performance. On the other hand, laboratory calibration sessions were conducted as instructed by the occupational therapist during laboratory visits throughout the trial.
In each calibration, seven movements were recorded, including elbow flexion, elbow extension, wrist pronation, wrist supernation, hand open, chunk grip, and rest. Except for rest, each calibration motion was supposed to be performed twice, lasting three seconds each. After each calibration, sEMG signal data were stored in the memory of the embedded controller so that prosthesis usage data could be accessed after the home or laboratory trial. We used the calibration data of the whole 6-8 weeks of home and laboratory trials. Table 1 shows calibration times for each participant. In addition, because the number of calibrations varies in the laboratory and home, we chose equal calibration times for the laboratory and home setting based on the side with fewer calibrations. We balanced the time of laboratory calibrations before and after the home trial. The selected data were as close in time as possible to minimize the effect of time, which could cause different body conditions, as well as familiarity with control of the prosthesis, resulting in different EMG signals.

Statistical Properties Calculation
We decided to describe raw sEMG signals using the following statistical properties to understand how the signals differ from home to the laboratory. Then, we averaged all calculated statistical properties of overall channels and motions for each calibration of each participant.

1.
Root Mean Square (RMS) where n is the number of samples, and x i is the amplitude of sample i.

2.
Mean Frequency (MeanF) [23] where M is the number of frequency bins, f i is the frequency of the spectrum at bin i, and P i is the power spectrum at bin i.

3.
Median Frequency (MedF) [23] ∑ MedF where P i is the power spectrum at bin i, and M is the number of frequency bins. The total power spectra are divided into two equal parts at the median frequency.

Variance
where x i is the amplitude of the signal at sample point i, x is the mean amplitude of sEMG signals, and n is the number of samples.

Signal Processing and Feature Extraction
The obtained sEMG signals were processed using MATLAB R2020b. We filtered the EMG signals between 20 and 500 Hz using a fourth-order Butterworth filter. Subsequently, filtered signals were segmented using overlapping windows of 200 ms, each with 30 ms increments. Hudgin's feature set [24] with Willison amplitude was extracted in each window.

Calibration Quality Quantification
In our previous research [25], we demonstrated that quantification of feature change could effectively reflect how sEMGs change under time effect. Hence, quantifying the feature space variation could be critical to evaluating changes in calibration data. We tested four separability indices, one repeatability index, and two correlation coefficients as signal quality quantification metrics.

Separability Indices
Separability indices between each motion were used to measure the diversity of each motion pattern in feature space based on statistical criteria for each calibration. These separability indices were related to the combination of within-and between-class information to describe the classifiability of calibration data. Because some methods are used to evaluate the separability between two classes, we calculated these indices between each motion class (i.e., there were K = x two-class combinations) and averaged them for single calibration data. In this study, we used the following four separability indices: • Davies-Bouldin index (DBI) [26] The DBI measures the worst-case separability of neighbouring classes in feature space by averaging the highest magnitude of overlap among them. Hence, a lower value of DBI indicates higher class separability. Equations (5)-(7) illustrate how it is computed: where S h is the diversity of features within a class, C h is the h th class, C l is the l th class (C h = C l ), N h is the number of feature vectors in the h th class, x i is the i th feature vector in the h th class, D hl is the similarity between classes, µ h is the mean of the feature vector in the h th class, R hl combines D hl and S h to measure the overlap between two classes, and K is the number of pairs of classes.
• Simplified Silhouette value (SS) [27] SS is a computationally efficient version of the silhouette value. It analyses the consistency of each point in its class and the diversity of each point from other classes. Summarizing SS of all data points enables determination of the level of separability between two classes. The range of SS is −1 to 1, with −1 representing the worst separability and 1 representing the best separability. Equations (8) and (9) illustrate how it is computed: where a(i) is the distance between a feature vector (x i ) and a centroid of its own class, b(i) is the distance of x i to the centroid of the other class. ss(i) the single SS for a single-feature vector, and N h is the number of feature vectors in the h th class.
• Fisher's linear discriminate analysis index (FLDI) [28] FLDI can be applied to a multiclass problem, which is the ratio between the betweenclass and within-class scatter matrices, as shown in Equations (10)- (12). A larger FLDI implies greater separability.
where S b is the between-class scatter matrix, S w is the within-class scatter matrix, c is the number of classes, N i is the number of feature vectors in the i th class, µ i is the mean feature vector in the i th class, µ is the mean of all classes, and x ij is the j th feature vector in the i th class.
• Separability index (SI) [29] Sensors 2022, 22, 9849 6 of 14 The SI measures distances between the centroid of the ellipse of each class and the nearest class averaged across all motion classes, as formulated in Equation (13). The higher the SI, the more separability there is between classes.
where N is the number of motion classes; µ i and µ j are the centroids of i th class and j th class, respectively; and S −1 i is the covariance of the i th class.

Repeatability Index and Correlation Coefficients
To investigate the performance of a trained classifier on other calibration data, we calculated the repeatability index and correlation coefficients between training and testing calibration data. The change in feature space distribution can reflect the temporal and spatial variation in EMG signals [25]. Therefore, the selected correlation coefficients are primarily used to determine whether the distributions differ, indicating the consistency of the calibrations. We concatenated all channels for each motion to obtain each feature space's kernel-smoothed probability density functions (PDFs). Subsequently, correlation coefficients were calculated based on PDFs. Equations (14)-(16)show these values were calculated. Because correlation coefficients are computed between two single-feature distributions, we averaged them over features and motions.

•
Repeatability index (RI) [29] The RI was previously explored in [29,30]. Both results showed that RI is an effective index to measure the consistency of EMG motion patterns in feature space generated in different trials. The RI is calculated as the distance between the centroid of the ellipse in one calibration and the class in another calibration, then averaged over all motion classes. It is formulated as in Equation (14).
where N is the number of motion classes; µ Tri and µ Tsi are the centroid of i th training and testing class, respectively; and S −1 i is the covariance of the i th training class. A lower RI indicates more consistency between training and testing data.
• Two-Sample Kolmogorov-Smirnov Test statistics (K-S) [31] K-S provides information on the similarity between two distributions as formulated in Equation (15). Data from training and testing tend to be well-correlated when the K-S is low.
where F 1 (·) and F 2 (·) are the cumulative distribution functions of two feature distributions, M is the number of features in the feature space, and N is the number of motion classes.
• Spearman correlations (rho) [32] Rho measures how two distributions are monotonically related. It is explained in Equation (16). In the rho value, −1 indicates that two feature distributions are totally different, whereas 1 represents the highest similarity between two feature distributions.
where d is the rank difference between the two ranks of each probability density, and n is the number of probability densities.

Data Analysis
Linear discriminant analysis (LDA) was selected as the classifier. Classification can be divided into two parts. In the first part, called within-calibration classification (WCC), we used an eightfold cross-validation procedure to evaluate how the classifier performed when trained and tested within the same calibration. Another part estimated the betweencalibration classification (BCC) performance using the leave-one-calibration-out crossvalidation method. To determine whether there are statistically significant differences in classification performance between home and laboratory calibration data, we performed sign tests on both WCC and BCC errors. Furthermore, we applied linear regression between each separability index as an independent variable against WCC errors.
Similarly, linear regression was used between the repeatability index and each correlation coefficient as independent variables against BCC errors. The linearity between these indices and classification errors was represented by the p-value and R-squared value of each linear model to determine whether they are reasonable to describe calibration data viability. We used the sign test to determine statistical differences between home and laboratory settings for each evaluation metric.  we used an eightfold cross-validation procedure to evaluate how the classifier performed when trained and tested within the same calibration. Another part estimated the betweencalibration classification (BCC) performance using the leave-one-calibration-out cross-validation method. To determine whether there are statistically significant differences in classification performance between home and laboratory calibration data, we performed sign tests on both WCC and BCC errors. Furthermore, we applied linear regression between each separability index as an independent variable against WCC errors. Similarly, linear regression was used between the repeatability index and each correlation coefficient as independent variables against BCC errors. The linearity between these indices and classification errors was represented by the p-value and R-squared value of each linear model to determine whether they are reasonable to describe calibration data viability. We used the sign test to determine statistical differences between home and laboratory settings for each evaluation metric.

Statistical Properties and Classification
The four statistical properties of sEMG from home and laboratory setting for each participant are shown in Figure 2. The sign test revealed a significant difference in the RMS and the variance of sEMG, which were both larger in the laboratory than at home (p < 0.001). There was a greater mean and median frequency in the home than in the laboratory (p < 0.001). The sign test results for calibrations of all participants between home and laboratory are summarized in Table 2.

Statistical Properties and Classification
The four statistical properties of sEMG from home and laboratory setting for each participant are shown in Figure 2. The sign test revealed a significant difference in the RMS and the variance of sEMG, which were both larger in the laboratory than at home ( 0.001). There was a greater mean and median frequency in the home than in the laboratory (p < 0.001). The sign test results for calibrations of all participants between home and laboratory are summarized in Table 2.  WCC and BCC errors are presented in Table 3. All BCC errors are larger than those of WCC, with the lowest error of 28.40 ± 4.91% for BCC and 5.61 ± 1.55% for WCC. The overall absolute value of the global mean WCC and BCC errors in the laboratory is higher than at home, although only BCC showed a significant difference (p < 0.05).  WCC and BCC errors are presented in Table 3. All BCC errors are larger than those of WCC, with the lowest error of 28.40 ± 4.91% for BCC and 5.61 ± 1.55% for WCC. The overall absolute value of the global mean WCC and BCC errors in the laboratory is higher than at home, although only BCC showed a significant difference (p < 0.05).

Metrics for Calibration Quality Quantification
For all metrics used to quantify the quality of signals, the line-fitting results across metrics and classification errors from all participants are summarized in Table 4. Figures 3 and 4 show examples of how we fitted WCC with DBI and BCC with RI into linear regression models. Table 4. This table illustrates whether linearity exists (1) between separability indices and WCC errors and (2) between repeatability, correlation coefficient, and BCC errors. All R squares have p < 0.05, except for K-S in the laboratory. CC is the correlation coefficient. p-value indicates whether there are significant differences between the home and laboratory settings for each metric (bold-faced).

Metrics for Calibration Quality Quantification
For all metrics used to quantify the quality of signals, the line-fitting results across metrics and classification errors from all participants are summarized in Table 4. Figures  3 and 4 show examples of how we fitted WCC with DBI and BCC with RI into linear regression models.     All separability indices have a high degree of linear relationship with WCC errors in home and lab contexts. WCC errors are lower with lower DBI and higher SS, FLDI, and SI. Additionally, RI has a linear relationship with BCC errors (higher RI with higher BCC errors) in home and lab calibration data. In contrast, K-S and rho have no and low linearity with BCC error in lab calibration data, respectively. Based on the averaged index values across all calibrations and the sign test on all metrics, only DBI and SI indicate that home calibrations have better separability than laboratory calibrations.

Discussion
The aim of this study was to compare the calibration of sEMG signals between home and laboratory settings through analysis of the statistical properties of sEMG signals and to quantify the calibration quality in both contexts. The overall results shows a better calibration quality at home than in the laboratory. In sEMG signals, RMS is related to the contraction forces, and variance represents sEMG signal power. Statistical analysis results show that there is a significant difference between home and laboratory settings, which as contraction levels vary between the two contexts. Because it is difficult for amputees to consistently produce contraction levels without proprioceptive and visual feedback [33], the force used to calibrate prostheses can vary each time. In the laboratory, amputees might have been more concentrated (i.e., high motivation or awareness) on performing motions, which resulted in high RMS and variance values. In addition, intensive concentration can lead to mental fatigue, which causes the recruitment of muscle fiber to be altered when generating the same force and motion pattern [34], which influences the consistency of the EMG signal. On the other hand, contraction levels could be estimated by Med F and Mean F, but the estimation is affected by the type of contraction, the subject, and the muscle length [35]. Med F and Mean F are the gold standards for assessing muscle fatigue using surface EMG signals because muscle fatigue results in a downward frequency shift [23]. Given the significant differences between home and laboratory setting in Med F and Mean F, muscle fatigue could potentially occur in the muscle when the participant calibrates their prosthesis in the laboratory.
The WCC performance with the selected classifier and feature set obtained promising results with 6-8 weeks of home trial and lab calibration data. However, from the perspective of overall mean errors, the WCC errors in the lab are slightly higher than those in the home, despite no significant difference in the statistical test. In a study conducted by Waris et al. [8], LDA showed better performance and robustness than conventional classifiers on a fluctuated sEMG signal over seven days. Hence, the potential reason for the lack of difference in the WCC could be that the LDA and selected feature sets are robust to the divergence of sEMG between home and laboratory setting. During home-trial recording, signal noise and user timing issues could be the main reason for low-quality signals at home [19]. Signal noise issues include impedance change (when the skin's temperature rises and sweat starts to form), intermittent electrode contacts with the skin (due to muscle volume variation when performing contraction, socket movement, etc.), and poor wire condition. User timing issues included unexpected activity during resting, insufficient contraction time, and missed contractions. Compared with home calibrations, calibrations in the lab also contained signal noise issues and timing issues, even under supervision. Figure 5 show a raw sEMG signal from the laboratory. In addition, we found that a large proportion of laboratory calibrations had issues of insufficient contraction time, which mixed resting signals with other motions. A short contraction time results in a low diversity between motion patterns and reduced classifiability. Furthermore, we used the resting-based threshold for WAMP to improve class separability [36]. The spontaneous activity during resting fluctuates the feature's threshold and induces unknown motion into the signal.
rs 2022, 22, x FOR PEER REVIEW 11 of difference in the WCC could be that the LDA and selected feature sets are robust to th divergence of sEMG between home and laboratory setting. During home-trial recordin signal noise and user timing issues could be the main reason for low-quality signals home [19]. Signal noise issues include impedance change (when the skin's temperatu rises and sweat starts to form), intermittent electrode contacts with the skin (due to musc volume variation when performing contraction, socket movement, etc.), and poor wi condition. User timing issues included unexpected activity during resting, insufficie contraction time, and missed contractions. Compared with home calibrations, calibratio in the lab also contained signal noise issues and timing issues, even under supervisio Figure 5 show a raw sEMG signal from the laboratory. In addition, we found that a lar proportion of laboratory calibrations had issues of insufficient contraction time, whi mixed resting signals with other motions. A short contraction time results in a low dive sity between motion patterns and reduced classifiability. Furthermore, we used the res ing-based threshold for WAMP to improve class separability [36]. The spontaneous acti ity during resting fluctuates the feature's threshold and induces unknown motion into th signal. On the other hand, BCC errors are much larger than WCC errors due to the stochast characteristics of sEMG. Whereas we chose the calibration data as close in time as possib the time interval between the two calibrations could be weeks, as the subjects calibrate the prosthesis at home for 6-8 weeks. The increasing time gaps between training and tes ing data deteriorated classification performance [5,8]. Except for TH04, all subjects ha crossed home and laboratory trials or the interval was not more than one week. TH04 lab trial was performed one month after the last calibration of the home trial. Becau TH04 was not using the prosthesis for an extensive period, he could not produce co sistent motion patterns across different calibrations in the lab. As a result, TH04 had th highest BCC error and, with a considerable difference in WCC error between home an lab settings.
In metrics for calibration quality quantification, DBI had the highest R-squared valu followed by SS and FLDI. With a reasonable degree of linearity, it can be concluded th these three indices can be used as quality indices to assist a user in determining wheth On the other hand, BCC errors are much larger than WCC errors due to the stochastic characteristics of sEMG. Whereas we chose the calibration data as close in time as possible, the time interval between the two calibrations could be weeks, as the subjects calibrated the prosthesis at home for 6-8 weeks. The increasing time gaps between training and testing data deteriorated classification performance [5,8]. Except for TH04, all subjects had crossed home and laboratory trials or the interval was not more than one week. TH04's lab trial was performed one month after the last calibration of the home trial. Because TH04 was not using the prosthesis for an extensive period, he could not produce consistent motion patterns across different calibrations in the lab. As a result, TH04 had the highest BCC error and, with a considerable difference in WCC error between home and lab settings.
In metrics for calibration quality quantification, DBI had the highest R-squared value, followed by SS and FLDI. With a reasonable degree of linearity, it can be concluded that these three indices can be used as quality indices to assist a user in determining whether additional calibrations for prostheses are needed. Because the repeatability index and correlation coefficient reveal the consistency between the two calibration data, they may compare calibration data with historical data with good motion patterns. Nathan et al. [37] developed a calibration quality feedback tool to increase the function of myoelectric prostheses. They used the separability index and repeatability index to evaluate calibration data with a rating system and advice for subsequent recalibration.
The results of this study are encouraging in terms of home use of myoelectric prostheses. However, the study is limited, as it only compares signals without considering contextual factors.

Conclusions
In this study, we adopted a dedicated methodological approach to assess the quality of data recorded at home during prosthesis use, data recorded in a laboratory setting, and how the two contexts affect performance. Results obtained in this study indicate that the within-calibration classification results of the sEMG of TMR amputees between home and laboratory settings did not significantly differ, but the quality of calibrations was different, with home data providing better separability. However, the between-calibration performance was better at home than in the laboratory despite no statistical difference in the repeatability metrics. These results show that although the motivation and engagement of patients might differ between home and laboratory settings, they have no significant influence on the within-calibration performance.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data analyzed in this study are available from Levi Hargrove.