1. Introduction
This paper presents a study on classifying phones (speech sounds) using electromyographic (EMG) signals obtained from the recently developed Spanish ReSSInt-EMG database. This database is part of the ReSSInt project [
1], which aims to restore speech for laryngectomees using an EMG-based silent speech interface (SSI). Laryngectomees are individuals whose larynx (voice box) has been surgically removed, and as a result, they are no longer able to produce speech naturally and thus depend on alternative methods to communicate verbally. There exist three main options for voice restoration after laryngectomy, namely esophageal, tracheoesophageal, and electrolaryngeal speech. However, each of these alternative speaking methods has some limitations [
2].
For this reason, important research efforts are dedicated to developing technological solutions to overcome those limitations. Technological approaches to restore speech for laryngectomees include personalized text-to-speech systems, voice conversion, bionic voices, lean-AI approaches, and SSIs, among others [
3].
The ReSSInt project of which the current study is part of aims to create a database and research the potential of developing an SSI for Spanish laryngectomees. Most of the works and databases related to SSIs [
4,
5,
6] have been developed for English, and there are some for other languages [
7,
8,
9,
10,
11]. However, none of these works focus on Spanish, and therefore, this project intends to narrow that gap.
SSIs aim to convert non-acoustic biosignals into text or acoustic speech [
12,
13]. Biosignals refer to the product of chemical, electrical, physical, and biological processes taking place during speech production, such as neural activity, articulator motor control, muscle activity, articulatory gestures, the vibration of the vocal folds, and pulmonary activity. Technologies to capture these biosignals include vocal tract imaging [
14], magnetic tracing [
15], electroencephalogram [
16], and EMG [
17,
18]. The conversion from these silent biosignals to audible speech can be done directly—using some machine-learning algorithms that model the relationship between the feature vectors extracted from the biosignals and the acoustic signals [
5,
19]—or indirectly—by first producing the related text [
20,
21,
22] and then using a text-to-speech (TTS) model to generate synthetic speech.
The non-acoustic biosignals that are used in this work are EMG signals or, more specifically, surface (i.e., non-invasive) EMG [
23]. Electromyography is a technique used to measure and record the electrical activity of muscles. When a muscle is active, it produces an electrical signal, called an action potential, that can be detected by an electrode placed on the skin over the muscle. Since for this study we are interested in speech, we target muscles in the face and the neck.
In order to develop an EMG-to-speech SSI, a large database of EMG and speech data is required. The main idea is to obtain a model trained on large amounts of parallel EMG and speech data. To ensure the generalization capabilities of the models, it is important to use a diverse and representative dataset for training. However, the process of acquiring the data is complex and presents a number of difficulties.
Two prominent challenges in the development of these interfaces are the dependency of the trained models on the session (session dependency) and on the speaker (speaker dependency). Session dependency arises from the variations observed in the obtained EMG signals when electrodes are positioned differently on the subject’s face. Speaker dependency is due to differences in the way of speaking from person to person. Additionally, an important issue arises from inadequate adhesion of the electrodes to the skin, leading to the detachment of electrodes over time and the generation of noisy signals. As a consequence, long sessions are difficult to carry out, thus limiting the amount of data available per session.
EMG signals have been previously used to perform phone classification [
24,
25], syllable identification [
9], word recognition [
11,
26], continuous speech recognition [
27], speaker recognition [
28,
29], and direct speech generation [
22,
30,
31,
32]. In this study, we perform a set of phone classification experiments using data from different speakers and sessions. Classifying phones offers a straightforward means of gaining valuable insights into the information conveyed by each muscle involved in the speech production process, making it an advantageous task for studying a setup performance [
33].
This work is an extension of the study presented in [
34], which describes a set of experiments designed to validate the acquisition setup of the newly developed ReSSInt-EMG database. Using data from nine recording sessions, in our previous study, we compared the performance of the new database with that of a comparable subset extracted from the well-known EMG-UKA Trial Corpus [
4]. The results of the phone classification experiments performed on both databases reassured us of the established data acquisition procedures. In this paper, we extend the experiments and analysis to newly acquired data and analyze the speaker and session dependency of the results while at the same time improving the classification and feature-reduction methods.
This paper is structured as follows:
Section 2 describes the data acquisition setup, including the recording procedure and the electrode setup, as well as the ReSSInt-EMG database, feature extraction method, and the phone classification experiments. The results of the experiments are described in
Section 3, which are then interpreted and discussed in
Section 4.
2. Materials and Methods
This section provides a thorough description of our study’s methodology. Specifically, we detail the materials and procedures used to record the database and provide comprehensive information on its contents. We also describe the methodology employed to calculate the features extracted from the EMG signals. Finally, we describe the classification experiments conducted in this research.
2.1. Acquisition Setup
This section describes in detail the devices used to record the database, the methodology employed to identify reference points on participants’ faces, and our approach to mitigating inter-session variability. Additionally, we provide comprehensive information regarding the set of tracked muscles and outline the procedure used to select them.
2.1.1. Recording Procedure
Each session is recorded in a soundproof room using a silent computer in an attempt to reduce interference with the audio and EMG signals as much as possible. The EMG signals are recorded with a Quattrocento bio-electrical amplifier at a sampling frequency of 2048 Hz, and the voice is captured with a Neumann TLM103 (diaphragm) microphone with a sampling frequency of 16 kHz.
For the acquisition and synchronization of the audio and EMG signals, we use publicly available software (
https://github.com/cognitive-systems-lab/EMG-GUI, accessed on 1 March 2022), which also includes a user interface. Additionally, a camera captures a video of the facial movements, which is meant to provide supplementary data and allow multi-modal experiments in the future, such as automatic lip reading. For this paper, the video data are not considered. See
Figure 1 for a photo of the complete acquisition setup.
In order to reduce inter-session variability in audio and video as much as possible, the positions of the subject, microphone, and video camera are kept constant for all sessions. Furthermore, a personalized 3D mask (
Figure 2) is used to ensure that the electrode locations remain constant throughout all sessions. Prior to the first session with each speaker, we locate the positions of the electrodes using reference points and a measuring tape. To give an example, to locate the risorius, or laughing muscle, we position the first electrode adjacent to the corner of the mouth and place the second electrode in the direction of the earlobe on the same side of the face. We mark three points: one on each outer side of both electrodes and one in the middle. This process is repeated for all eight electrode pairs, resulting in a total of 24 reference points. A 3D-printing professional then creates a 3D scan of the face and prints a mask with holes corresponding to the reference points. During subsequent sessions, we draw the points again on the subject’s face using the holes and place the electrodes accordingly.
Prior to each recording session, speakers are instructed to articulate their speech slightly more than usual. A supervisor is always present in the room to ensure the correct pronunciation of the utterances.
2.1.2. Electrode Setup
Previous studies have employed various approaches in determining the optimal electrode setup, such as targeting muscles specifically [
31,
35,
36,
37,
38], analyzing anatomical regions [
20], and looking for patterns in a high-density electrode setup [
39]. Knowing that an activation potential travels along the muscle as a wave, the most appropriate way to use bipolar acquisition is to place the two electrodes longitudinally over the muscle. We decided to target muscles individually and performed a pilot study that consisted of targeting all superficial muscles in the face and neck to find the muscles that were most useful for the task. The final setup (see
Figure 3) slightly differs from those used in the previously mentioned studies. These are the targeted muscles (using one channel each):
- 1
Levator labii superioris (channel 1)
- 2
Masseter (channel 2)
- 3
Risorius (channel 3)
- 4
Depressor labii inferioris (channel 4)
- 5
Zygomaticus major (channel 5)
- 6
Depressor anguli oris (channel 6)
- 7
Anterior belly of the digastric (channel 7)
- 8
Stylohyoid (channel 8)
2.2. The ReSSInt-EMG Database
Table 1 shows the details of the currently recorded sessions of the ReSSInt-EMG database, namely, 16 sessions in total from 6 different speakers. Note that the acquisition process is still ongoing and that the final database will be larger. The complete database also includes data from laryngectomees, since they are our final target users. However, the data from laryngectomees cannot be used for the phone classification experiments in this study since aligned audio signals are required in order to obtain labeled segments.
In each recording session, three different kinds of items are recorded, namely: non-sense words including vowel–consonant–vowel (VCV) structures, isolated words, and sentences. The sentences are taken from the Sharvard Corpus [
40] and from a text corpus called Ahosyn that was developed to record TTS databases [
41] (see
Table 2).
For the current experiments, we only used the signals corresponding to the Sharvard and Ahosyn sentences and not the VCV combinations or isolated words. The number of Ahosyn sentences for each session is smaller than the number of Sharvard sentences because they are generally longer.
Each session is split into 80% training and 20% testing data. During the recording process, utterances are presented in a unique and random order for each session. To ensure consistency, we assigned the final 20% of each set of sentences as the testing set prior to the experiment. This approach ensures that the time of recording within each session is unrelated to the train–test split, and the utterances designated as the testing set remain constant for each speaker.
Each utterance is segmented at the phone level using the Montreal Forced Aligner [
42]. The phonetic dictionary was created using the Aholab transcriber, which uses the Spanish SAMPA phone set, comprising 29 phones. Initial and final silences were removed, while short pauses between words were considered in the classification experiments.
2.3. Feature Extraction
After removing the direct-current offsets from the EMG signals and normalizing them, five time-domain (TD) features are calculated as proposed in [
38]. Similar parameters with small variations have also been used in [
25,
43]. The procedure to obtain these TD features is described here for clarity purposes.
First, the signal (
) is separated into two components: a low-frequency signal (
) and a high-frequency signal (
). To obtain the low-frequency signal,
, a double average of
is calculated using a nine-point window:
Having calculated
, we can then obtain the high-frequency signal
by subtracting
from
:
A rectified version
of the high-frequency signal is also obtained, given by:
Once
,
, and
are obtained, the set of five time-domain features of a frame is defined as follows:
where:
To provide the classifier with temporal context, a stacking filter concatenates the features of
adjacent frames. Specifically, the stacked feature vector of the
j-th frame, denoted by
, is given by:
Here, j is the index of the central frame (i.e., the frame intended to be classified). A stacking filter of is chosen, combining a total of 31 frames.
Finally the stacked TD0 vectors from all eight channels are combined into a single array, which serves as the input for the classifier.
We used a window with a duration of 25 ms and a frame-shift of 5 ms to extract the EMG features. Since five TD features are calculated for each of the EMG channels, the length of the parameter vector assigned to each frame is calculated as
which results in 1240 features for a width of the stacking filter of
and
channels.
To reduce the dimension of the parameter vector, we apply linear discriminant analysis (LDA) [
44], as in [
25,
43]. To select the optimum dimension, we analyzed the effect of the number of features on the frame-based phone classification accuracy.
Figure 4 shows the average validation accuracy per number of LDA features for the first session of each speaker. Based on this graph, we chose to use 21 LDA features because the average accuracy reaches a plateau at that value. Choosing a higher number of features would result in a more complex model and a longer training time. The classifier used to search for the optimal LDA value was a neural network with a batch size of 128 and 20 epochs.
2.4. Experiments
This section describes the experimental part of the study, namely, the classifier used and its configuration, the manner in which we considered speaker and session dependency for the experiments, and how we applied cross-validation.
2.4.1. Classification Method
The classifier used for the experimental part of this study is a feed-forward neural network with one hidden layer using a batch size of 256 and 100 training epochs. We chose these parameters based on a hyper-parameter search by tracking the validation accuracy during 250 training epochs for three batch sizes: 64, 128, and 256. We repeated this for Session 101 of all six speakers and averaged the results (see
Figure 5). We chose 100 training epochs because at that point the performance reaches a plateau, and we chose the largest batch size because there is no difference between the three batch sizes, and a larger batch size means lower training time. The network has an input dimension equal to the number of features (21 nodes) and a dense layer with twice as many nodes as features (42 nodes in total) using a rectified linear units activation function [
45]. The output layer has the same number of nodes as the number of classes (30, which includes 29 phones and a silence) and uses a
softmax activation function [
46]. Furthermore, a categorical cross-entropy loss function and the
Adam optimizer [
47] with a learning rate of 0.001 are applied.
2.4.2. Speaker and Session Dependency
Our study involves three separate rounds of experiments, each varying in terms of speaker and session dependencies. The first round of experiments was both speaker-dependent and also session-dependent, which means that the training and testing data were taken from the same session. Additionally, we performed a second round of experiments in which the data were speaker-dependent but session-independent. This means that the training data came from a different session or different sessions than the testing data, but that all sessions were recorded by the same speaker. This method allows for the evaluation of the effect of increasing the amount of data from the same speaker on the performance of the model as well as the impact of inter-session variability on the accuracy. In the third round, we used speaker-independent data by training the model using data from multiple sessions of one speaker and testing it using data from another speaker. The testing session contains a session-specific corpus that was not included in the sessions used to train the model, making the experiment both speaker-independent and session-independent. The goal is to assess the potential to create a model that can be applied to new speakers without the need for adaptation by training it only on data from the actual database.
2.4.3. Cross-Validation
We used five-fold cross-validation to obtain the validation accuracy. This means that five different classifiers are trained, each time leaving out a different fold that functions as the validation set. The obtained results were then averaged. The testing accuracy was obtained after a new classifier was trained using all the training data and then tested on the unseen testing set.
4. Discussion and Conclusions
This paper presents the results of phone classification experiments conducted on the new ReSSInt-EMG database. Compared to previous work [
34], we revisited the linear discriminant analysis (LDA) reduction procedure, which resulted in changing the number of LDA features from 28 to 21. The change in the number of features used to train the model helped to reduce the training time and the complexity of the model, but the obtained accuracy remained similar. Furthermore, we included new sessions from the speakers that were already part of the database and recordings from two new speakers and extended the experiments with different modalities regarding speaker and session dependency. To accommodate the increased complexity of these experiments, we also used a neural network as a classification method instead of a bagging classifier.
The session-dependent classification results show varying outcomes not only across speakers but also across multiple sessions from the same speaker. Furthermore, the session-independent results indicate a substantial decrease in testing accuracy when the model is applied to data from sessions not included in the training phase, with the magnitude of this effect differing between the two speakers.
The decrease in testing accuracy observed when training with data from a session different from the one used to test the model is likely due to inter-session variability, which can be attributed to several factors. First, despite the use of a 3D mask, variations in electrode placement can occur between sessions. Second, the physical or mental state of the speaker may lead to slight differences in articulation between sessions, as each is recorded on a different day. For instance, a person may articulate differently when feeling exhausted, resulting in less articulation effort. Third, environmental conditions such as temperature and humidity can affect the speaker’s state and the contact between the electrodes and the skin. High temperatures may cause increased sweating and decreased motivation. These factors can impact the recorded EMG signals, resulting in each session being recorded under unique circumstances. Consequently, a model that can identify patterns in the EMG signals of one session may struggle to recognize those same patterns in signals from a different session.
Interestingly, when additional session data are added to the training data, testing accuracy increases. Given a corresponding decrease in validation accuracy, we believe that the improvement is due to enhanced diversity and representation of the data, allowing the model to better generalize beyond the training data. These results suggest that developing an EMG-based SSI with sufficient performance for real-world applications requires a large and diverse database. While using a larger set of training data may potentially slow down the experiments and require additional resources, we firmly believe that it is crucial to leverage as much training data as possible, provided that sufficient processing capabilities are available and the addition of new data leads to improved model performance. Our rationale stems from the fact that an SSI system suitable for real-world applications requires extensive preparation to handle unseen data.
The speaker-independent classification results demonstrate a substantial decrease in model accuracy when trained with data from other speakers, even when the amount of training data is comparable to the speaker-dependent, session-independent models. This suggests that the differences between speakers’ data are substantial, making it challenging for the model to generalize to a different speaker. These differences can be attributed to various factors, such as differences in speakers’ physiognomy, articulation manner, or speaking pace. These findings suggest that using an SSI trained on a different speaker presents extra difficulty. Further experiments are needed to investigate whether training the model with a more extensive database from a single speaker or with data from multiple speakers can enhance speaker-independent performance.
It is important to note that during four sessions, we observed a deviation in the signals of one channel, which cast uncertainty on its quality. The classification accuracy of these sessions is indeed lower compared to the other sessions by the same speaker. The most likely cause for these signal deviations is the detachment of electrodes in this channel during recording or the use of defective electrodes or cables. Acquiring EMG data is a sensitive technique and can result in variations in EMG signals depending on the speaker and recording conditions.
Considering all of our findings, we plan to record more data from fewer speakers for future studies to address the issue of inter-session variability. We believe that this strategy will allow us to collect a more diverse range of data and enhance the performance of the EMG-based SSI. Furthermore, we intend to undertake more complex tasks, such as direct speech generation from EMG signals, to achieve our ultimate goal of developing an EMG-based SSI for Spanish-speaking laryngectomees.