Evaluation of Speech Quality Through Recognition and Classiﬁcation of Phonemes

: This paper discusses an approach for assessing the quality of speech while undergoing speech rehabilitation. One of the main reasons for speech quality decrease during the surgical treatment of vocal tract diseases is the loss of the vocal tract’s parts and the disruption of its symmetry. In particular, one of the most common oncological diseases of the oral cavity is cancer of the tongue. During surgical treatment, a glossectomy is performed, which leads to the need for speech rehabilitation to eliminate the occurring speech defects, leading to a decrease in speech intelligibility. In this paper, we present an automated approach for conducting the speech quality evaluation. The approach relies on a convolutional neural network (CNN). The main idea of the approach is to train an individual neural network for a patient before having an operation to recognize typical sounding of phonemes for their speech. The neural network will thereby be able to evaluate the similarity between the patient’s speech before and after the surgery. The recognition based on the full phoneme set and the recognition by groups of phonemes were considered. The correspondence of assessments obtained through the autorecognition approach with those from the human-based approach is shown. The automated approach is principally applicable to deﬁning boundaries between phonemes. The paper shows that iterative training of the neural network and continuous updating of the training dataset gradually improve the ability of the CNN to deﬁne boundaries between di ﬀ erent phonemes.


Introduction
One of the most common types of tumors of the speech-forming tract organs is cancer of the tongue [1]. Surgical treatment, consisting of a glossectomy [2], leads to the loss of part of the tongue involved in the formation of a number of phonemes. In particular, even during reconstructive surgery, a disruption of the tongue's symmetry occurs. Ultimately, this leads to a disruption of the pronunciation, in particular, of pre-lingual consonants, which leads to a decrease in syllabic intelligibility and speech intelligibility in general.
In previous papers [3,4], we examined the application of automation in speech rehabilitation therapy for people who undergo a surgical intervention on speech organs. Currently, the evaluation of speech quality during rehabilitation is given by several experts (speech therapists) for providing more objective results. This procedure is time-consuming for the experts and requires a patient to come to a hospital. It is not always convenient and possible for a patient. Due to these problems, the idea of automating the rehabilitation process came up.
A speaker-dependent neural network is trained for each patient on audio records pronounced by a patient before a surgery. The neural network learns how a patient pronounces each type of phoneme, thus aiming to recognize whether the patient pronounces phonemes in the same manner after the surgery. Speech recorded by a patient is a set of syllables which includes the most problematic phonemes, such as /k/, /s/, /t/, etc. (hereinafter referred to as problematic phonemes). The list of these phonemes was compiled at the first stage of the study [4]. We could potentially use the complete classical table of syllables from GOST R 50840-95 [5] (5 tables, 250 syllables according to the method of evaluating syllable intelligibility). However, recording 250 syllables per session is a tedious task for a patient. Therefore, in agreement with physicians engaged in speech rehabilitation, it was decided to limit the complete list to 90 syllables. The sample is oriented towards the main problematic phonemes (/k/, /s/, /t/ and their soft implementations). The most problematic phoneme /r/ was excluded from consideration, because the mechanism of producing this phoneme changes fundamentally after the operation. Consequently, its comparison with the standard is meaningless. The audio records obtained before a surgery were considered as a benchmark. Subsequently, the quality of the speech after the surgery was evaluated in comparison with this benchmark.
In order to train the neural network to recognize phonemes in after-surgery records, it was necessary to know the start and the end point of each phoneme in before-surgery records. The phonemic composition of all syllables was known. However, the start and the end points of each phoneme were not defined. Consequently, this raised the question of phoneme alignment.
Initially, audio records were segmented at a phoneme level manually. The start and the end points of each phoneme in a syllable were defined acoustically and visually using a software tool that displayed spectrograms. For manual segmentation into phonemes and spectrograms, the software Praat [6] and Wavesurfer [7] were used. This approach was quite time intensive. As a result, it was necessary to move on to an automatic segmentation.
The aim of this work was a reduction in the time taken to align phonemes manually in before-surgery audio records.

Methods of Syllable Recognition
The task of recognizing syllables and phonemes is part of a more general task of speech recognition. To solve it, when conducting hierarchical analysis (from the level of phonemes and syllables to the level of sentences and their groups), the following approaches can be used: • hidden Markov models [8,9]; • hidden Markov models and Gaussian mixture models [10,11]; • deep learning neural networks [12,13]; • different hybrid models [14][15][16].
The second approach is end-to-end speech recognition. It differs from sequential hierarchical analysis in that it allows you to analyze the original signal and move to higher levels of analysis (for example, the level of words), bypassing lower levels [17,18].
In this paper, the recognition of phonemes within isolated syllables without context is of interest. This makes adjustments impossible at the upper levels of analysis. A statistical imbalance in the phonetic material increases the share of "problem" phonemes and contributes to the impossibility of using ready-made speech models. This makes it impossible to use ready-made solutions in the field of speech recognition, such as the Hidden Markov Model Toolkit (HTK) [19], Kaldi [20,21], Sphinx [22], and their local-language analogues.
In assessing the quality of syllable pronunciation, the following options may be identified: the direct recognition of pronounced phonemes within a syllable. The main disadvantage of this approach is a large number of classes (which is equal to the number of phonemes). As a result, we have a high level of error. On the other hand, the positive aspects include the straightforward result in the form of a phoneme sequence and the ease of error determination; the recognition of phonemes within a syllable as instances of classes formed of phonetic groups. The positive side is that the accuracy increases due to the reduced number of classes. However, it leads to a lack of direct interpretation, hence a lack of direct evaluation of the assessment according to GOST R 50840-95; -the identification of boundaries between phonetic segments using a neural network for the follow-up application of parametric approaches in comparing these segments. In this case, the accuracy of determining transitions between phonemes is more important than the classification accuracy. A measure of difference between the selected segments is determined on the basis of previously developed parametric methods [3]. The disadvantage of this option is the unavailability of quantitative assessment such as the classic syllable intelligibility (as defined in Section 2.1) in the output of the system.

Direct Recognition of Phonemes
In the context of recognition, we used an approach for searching audio fragments similar to those in the training dataset.
The approach implemented within the framework of this task had several limitations, some of which were introduced artificially.

1.
Dependence on a speaker. A model for assessing the quality of speech was built for each new speaker. There was no task to improve the quality of speech in relation to the already established manner of pronouncing phonemes or to the presence of speech defects. The task in the rehabilitation process was to maximize the conformity of a patient's after-surgery speech with their speech before the operative treatment. This limitation significantly simplifies the task because there is no need to use a large database of records from many speakers for training.

2.
Limited number of phonemes. We were primarily interested in the quality of pronouncing the phonemes that were most susceptible to change after the operation. For this reason, the table of syllables focuses specifically on those problematic phonemes.

3.
The assessment speed of one syllable, pronounced by a patient while practicing at the rehabilitation stage, should be as fast as possible. Currently, the evaluation takes 3 seconds per syllable. The training time of the convolutional neural network (CNN) takes less than one hour. The Adam optimizer with a mini-batch size of 128 was used for training of the neural network [23]. However, the training time did not matter much since the period of time between a before-surgery session and the first rehabilitation session was approximately one week.

4.
Within the framework of the paper, the term "syllable intelligibility" refers to the proportion of correctly recognized syllables among all of them pronounced by a patient in accordance with a predefined set of syllables. In the future, values of the output layer of the neural network will be used to assess the degree of similarity between a pronounced phoneme and the correct one in order to implement the biofeedback mechanism in the rehabilitation process. However, in this paper, the idea was to prove the applicability of this automated speech quality assessment approach to speech rehabilitation.

5.
It was known in advance which syllable was pronounced. There was no need to interpret the sequence of recognized phonemes, transforming it into a syllable; it was only necessary to estimate the proportion of correctly pronounced phonemes in that sequence.
To implement a deep neural network for the recognition of syllables in the framework of assessing the quality of their pronunciation, the computing environment MATLAB 2018a (MathWorks, Natick, MA, USA) [24]  The outputs had the following structure: vocalization output, softness output, and 21 classes for the phoneme identification, for a total of 23 outputs. The input layer contained 4800 neurons.

Algorithm of Automatic Time Alignment at Phoneme Level
First of all, the speaker-independent neural network was trained on a Russian language audio dataset containing 7 sentences recorded by 10 speakers of both genders. Every audio file was complemented by a transcript describing its phonemic composition and time periods.
The algorithm to align an unsegmented audio file included the following steps: 1.
Training a neural network (initially on data from an audio-aligned corpus).

2.
Recognition of a phonemic composition as phonetic sequences for each syllable.

4.
Adjustment of recognized phonemic compositions and selection of syllables with a correctly defined phonemic composition.

5.
Forming additional data from transcriptions of correctly recognized syllables. 6.
Going to step 1 and retraining the neural network on an updated dataset (including new data formed at step 5).
The steps were repeated until the required level of quality was reached.
The main sequence of actions in IDEF0 notation [25] is shown in Figure 1.

Algorithm of Automatic Time Alignment at Phoneme Level
First of all, the speaker-independent neural network was trained on a Russian language audio dataset containing 7 sentences recorded by 10 speakers of both genders. Every audio file was complemented by a transcript describing its phonemic composition and time periods.
The algorithm to align an unsegmented audio file included the following steps: 1. Training a neural network (initially on data from an audio-aligned corpus). 2. Recognition of a phonemic composition as phonetic sequences for each syllable. 3. Determining time-aligned phonemic transcriptions of syllables. 4. Adjustment of recognized phonemic compositions and selection of syllables with a correctly defined phonemic composition. 5. Forming additional data from transcriptions of correctly recognized syllables. 6. Going to step 1 and retraining the neural network on an updated dataset (including new data formed at step 5). The steps were repeated until the required level of quality was reached.
The main sequence of actions in IDEF0 notation [25] is shown in Figure 1.

Phoneme Recognition and Time Alignment
At the data preprocessing stage, WAV format audio files at a sample rate of 16 kHz were sliced into overlapping 20 ms frames with a frame step of 1 ms. Mel-frequency cepstral coefficients (MFCCs) were extracted for each frame.
One audio file of a syllable record lasts about 1000 ms (1 s). Thus, we obtained 981 frames of 20 ms length each. The arg max rule was used to compute the label for each time step. We referred to the independent labelling of each time step, or frame. Figure 2 depicts the best path decoding example [26,27] for a 1 s audio file.

Phoneme Recognition and Time Alignment
At the data preprocessing stage, WAV format audio files at a sample rate of 16 kHz were sliced into overlapping 20 ms frames with a frame step of 1 ms. Mel-frequency cepstral coefficients (MFCCs) were extracted for each frame.
One audio file of a syllable record lasts about 1000 ms (1 s). Thus, we obtained 981 frames of 20 ms length each. The arg max rule was used to compute the label for each time step. We referred to the independent labelling of each time step, or frame. Figure 2 depicts the best path decoding example [26,27] for a 1 s audio file. As a result, we obtained a sequence of labels. Many labels were certainly repetitive because a phoneme lasted a few milliseconds and frames were overlapped. The algorithm of removing the duplicates and calculating the phoneme's start and end points was considered further.
We defined two parameters for this algorithm: • The minimum length of a phoneme (the minimum number of consecutive frames labeled as the same phoneme) as min_seq_len.

•
The maximum length of deviations from a consecutive sequence of the same phoneme labels as max_dev_len.
These two parameters might vary depending on the type of phoneme. For example, phonemes /g/ and /d/ usually lasted less than phoneme /t/ or /s/.
We defined two parameters for this algorithm: • The minimum length of a phoneme (the minimum number of consecutive frames labeled as the same phoneme) as min_seq_len.

•
The maximum length of deviations from a consecutive sequence of the same phoneme labels as max_dev_len.
These two parameters might vary depending on the type of phoneme. For example, phonemes /g/ and /d/ usually lasted less than phoneme /t/ or /s/.
Considering a simplified sequence shown in Figure 3, which contains only 30 labels, the result of applying the algorithm described further (below Figure 3) is a time-aligned phonemic transcription as presented in Table 1. As a result, we obtained a sequence of labels. Many labels were certainly repetitive because a phoneme lasted a few milliseconds and frames were overlapped. The algorithm of removing the duplicates and calculating the phoneme's start and end points was considered further.
We defined two parameters for this algorithm: • The minimum length of a phoneme (the minimum number of consecutive frames labeled as the same phoneme) as min_seq_len.

•
The maximum length of deviations from a consecutive sequence of the same phoneme labels as max_dev_len.
These two parameters might vary depending on the type of phoneme. For example, phonemes /g/ and /d/ usually lasted less than phoneme /t/ or /s/.

Adjustment of Recognized Phonemic Composition
Since time-aligned phonemic transcriptions were defined for all the syllables, it was possible to apply some operations to adjust partly incorrect phonemic compositions of some syllables. We proposed three operations for the adjustment: These operations are simple and intuitive, and applied as shown in Figure 4.

. Adjustment of Recognized Phonemic Composition
Since time-aligned phonemic transcriptions were defined for all the syllables, it was possible to apply some operations to adjust partly incorrect phonemic compositions of some syllables. We proposed three operations for the adjustment: These operations are simple and intuitive, and applied as shown in Figure 4. If a phonemic composition of a syllable became completely correct after the adjustment, the timealigned phonemic transcription of this syllable was used to create new data for further neural network training.

The Direct Recognition of Problem Syllables
At this stage, a phoneme was considered to be correctly recognized if more than 50% of the correct samples were present. Results of the assessment of syllable intelligibility by five experts and using the proposed approach are presented in Table 2. The recognition of the whole syllable with problematic phonemes was considered. Estimation with its standard deviation by five different If a phonemic composition of a syllable became completely correct after the adjustment, the time-aligned phonemic transcription of this syllable was used to create new data for further neural network training.

The Direct Recognition of Problem Syllables
At this stage, a phoneme was considered to be correctly recognized if more than 50% of the correct samples were present. Results of the assessment of syllable intelligibility by five experts and using the proposed approach are presented in Table 2. The recognition of the whole syllable with problematic phonemes was considered. Estimation with its standard deviation by five different experts and individual neural networks for every person/patient is presented. The group of patients and the group of healthy speakers included three people in each group. The healthy speakers spoke with and without the use of their tongues in order to imitate the pronunciation of patients before and after surgery accordingly. The patients made their audio records before and after undergoing the surgery. "Person No." are healthy speakers and "Patient No." are patients who began the rehabilitation. "Normal" is a standard speech for healthy speakers. "Before-surgery" is a standard speech before operation for patients. "Without tongue" is speech without the use of a tongue for healthy speakers. "After-surgery" is speech after operation for patients. Records contain syllables with problematic phonemes (/t/, /k/, /s/, /t'/, /k'/, /s'/ [3]). The list of audio records contains 90 syllables.  Table 3 contains the same information, but for the calculation of the scores only problematic phonemes were used instead of the whole phoneme composition of syllables. This nuance is not substantial for experts, but has an important influence on the neural network. The main reason for this fact is the larger number of problematic phonemes in the training dataset in comparison with other phonemes. As a result, the neural network has a much smaller number of errors with respect to problematic phonemes. Table 3. The results of the assessment of syllable intelligibility by experts and using the proposed approach based on CNN for healthy speakers with and without the use of a tongue and for patients before and after surgery. The recognition of only problematic phonemes from syllables is considered. After considering Tables 2 and 3, the following conclusions were drawn.

1.
In Table 2, even for a healthy speaker, the syllable intelligibility calculated by the CNN did not reach 100%, thus diverging from the opinion of the experts. However, mistakes mostly arose from "non-problematic" phonemes, which is explained by their small share in the syllable set table.
In particular, some of the phoneme implementations in the table are missing, since recognition was not the ultimate goal of the system.

2.
On the other hand, for problematic phonemes, the results of which are presented in Table 3, the difference is statistically insignificant when using the Student's t-test with the 0.95 significance level. In the future, it is possible to increase this value due to the variation in the structure of the neural network used and its adaptation to the problem being solved.

3.
The qualitative assessment of syllable intelligibility given by the CNN corresponds to the experts' ones. This fact allows a discussion about the applicability of the proposed approach for solving the problem of speech quality assessment during speech rehabilitation. It also confirms the consistency at the level of ranking positions between the classical expert method for estimating syllable intelligibility and the proposed method using neural networks.

The Speaker-Dependent Neural Network for Class Segmentation
Retraining the neural network on an updated dataset makes it more and more speaker-dependent as well as more precise for a certain speaker-patient. Figure 5 depicts the graphical representation of gradual changes in a phonemic composition of syllable [g'1s] recognized by the neural network.
Symmetry 2019, 11, x; doi: FOR PEER REVIEW www.mdpi.com/journal/symmetry Figure 5. Example of an output label sequence at each time-step. "Iteration 0" represents the output of the speaker-independent neural network. Starting from the first iteration the neural network becomes a speaker-dependent. The "Iteration 3" and the "Iteration 6" allow us to see the progress of retraining. Figure 5. Example of an output label sequence at each time step. "Iteration 0" represents the output of the speaker-independent neural network. Starting from the first iteration, the neural network becomes speaker-dependent. "Iteration 3" and "Iteration 6" allow us to see the progress of retraining.
Among the characteristic features, the decrease in phonemes belonging to the class [d + d'] in the classification results can be distinguished. The count of windows with this class decreases with an increase in the number of iterations. This fact can be explained by the relatively small representation of the given class in the general sample. As a result, with additional training, fewer values fall into the training set, which leads to a decrease in the detection of this class. If we compare this result with the set of sounds [g + g'] and [k + k'] when solving the ultimate problem of setting boundaries with the known phonetic composition of the syllable and an a priori representation of the class, the following dependence is visible. The significance of the class that is absent in the identified syllable decreases with an increase in the number of iterations.
The experiment of syllable segmentation at a phoneme level was conducted with following characteristics: • 1 initial speaker-independent neural network trained on 7 sentences recorded by 10 speakers;  Table 4).  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables. After considering the results in Table 5, the following conclusions were drawn.  The proposed algorithm gives the expected result and may be applied for automatic identification of a phonemic composition of syllables as well as for determining the start and the end time points for each phoneme. The experiment of syllable segmentation at a phoneme level was conducted with following characteristics:  1 initial speaker-independent neural network trained on 7 sentences recorded by 10 speakers;  6 Russian-speaking patients (3 female and 3 male voices);  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables. Table 5. Summary of applying the algorithm of automatic syllable segmentation at a phoneme level.

Patient
No.

Fully Correct Recognized Syllables
Problematic Syllables this result with the set of sounds [g + g '] and [k + k'] when solving the ultimate problem of setting boundaries with the known phonetic composition of the syllable and an a priori representation of the class, the following dependence is visible. The significance of the class that is absent in the identified syllable decreases with an increase in the number of iterations. The experiment of syllable segmentation at a phoneme level was conducted with following characteristics:  1 initial speaker-independent neural network trained on 7 sentences recorded by 10 speakers;  6 Russian-speaking patients ( Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables. ] and [k + k'] when solving the ultimate problem of setting boundaries with the known phonetic composition of the syllable and an a priori representation of the class, the following dependence is visible. The significance of the class that is absent in the identified syllable decreases with an increase in the number of iterations. The experiment of syllable segmentation at a phoneme level was conducted with following characteristics:  1 initial speaker-independent neural network trained on 7 sentences recorded by 10 speakers;  6 Russian-speaking patients ( Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables. Table 5. Summary of applying the algorithm of automatic syllable segmentation at a phoneme level.

Patient
No.

Fully Correct Recognized Syllables
Problematic Syllables After considering the results in Table 5, the following conclusions were drawn.  The proposed algorithm gives the expected result and may be applied for automatic identification of a phonemic composition of syllables as well as for determining the start and the end time points for each phoneme.  The  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables.  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrain times as it needed to recognize an accurate phonemic composition for at least 12 out of 15  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retr times as it needed to recognize an accurate phonemic composition for at least 12 out o  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retr times as it needed to recognize an accurate phonemic composition for at least 12 out o After considering the results in Table 5, the following conclusions were drawn.  The proposed algorithm gives the expected result and may be applied identification of a phonemic composition of syllables as well as for determining th end time points for each phoneme.   Table 4).  Table 5 presents the results obtained for each patient. The neural network was retr times as it needed to recognize an accurate phonemic composition for at least 12 out o  Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrain times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 After considering the results in Table 5, the following conclusions were drawn.  The proposed algorithm gives the expected result and may be applied for ] 5 After considering the results in Table 5, the following conclusions were drawn.
• The proposed algorithm gives the expected result and may be applied for automatic identification of a phonemic composition of syllables as well as for determining the start and the end time points for each phoneme. this result with the set of sounds [g + g '] and [k + k'] when solving the ultimate problem of setting boundaries with the known phonetic composition of the syllable and an a priori representation of the class, the following dependence is visible. The significance of the class that is absent in the identified syllable decreases with an increase in the number of iterations. The experiment of syllable segmentation at a phoneme level was conducted with following characteristics:  1 initial speaker-independent neural network trained on 7 sentences recorded by 10 speakers;  6 Russian-speaking patients ( Table 4).  Table 5 presents the results obtained for each patient. The neural network was retrained as many times as it needed to recognize an accurate phonemic composition for at least 12 out of 15 syllables. Table 5. Summary of applying the algorithm of automatic syllable segmentation at a phoneme level.

Patient
No. ] turned out to be the most problematic syllables. This may be due to two consecutive voiceless phonemes, such as /k/, /t/ for [d'okt] or /s/, /t/ for [st'1t Symmetry 2019, 11, x FOR PEER REVIEW this result with the set of sounds [g + g '] and [k + k'] when solving the ultima boundaries with the known phonetic composition of the syllable and an a priori class, the following dependence is visible. The significance of the class that is ab syllable decreases with an increase in the number of iterations.

Gender
The experiment of syllable segmentation at a phoneme level was condu characteristics:  1 initial speaker-independent neural network trained on 7 sentences recor  6 Russian-speaking patients ( Table 4).  Table 5 presents the results obtained for each patient. The neural network w times as it needed to recognize an accurate phonemic composition for at least 1 ].

•
The efficiency of the algorithm may be improved by fine-tuning min_seq_len and max_dev_len parameters based on the specificities of phonemes.

Conclusions
In this paper, we proposed an approach to speech quality assessment based on a convolutional neural network. In the context of this work, speech recognition was applied for estimating syllable intelligibility according to the method presented in GOST R 50840-95 "Speech transmission over various communication channels". Methods for assessing the quality of the recognition were considered. In this approach's framework, the final deep neural network can act as an auditor and issue an appropriate quantitative estimate at the output. The received values allow a discussion about the absence of obvious contradictions between the results from the CNN and the estimations provided by experts. In addition, for objective assessments made by humans, it was necessary to have the opinions