GMM-Based Evaluation of Synthetic Speech Quality Using 2D Classiﬁcation in Pleasure-Arousal Scale †

: The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classiﬁer. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The ﬁnal evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial inﬂuence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classiﬁcation process and on the ﬁnal evaluation result. The main evaluation experiments conﬁrm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.


Introduction
At present, many different subjective and objective methods and criteria for quality evaluation of synthetic speech produced by text-to-speech (TTS) systems are used. For the subjective assessment of synthesis quality, listening tests are generally acknowledged. The conventional listening tests usually involve a comparison category rating on a scale from "much better" to "much worse" than high-quality reference speech [1]. Perceptual characteristics may be divided into five basic dimensions-(1) naturalness of voice, and its pleasantness, (2) prosodic quality including accentuation, rhythm, and intonation, (3) fluency and intelligibility, (4) absence of disturbances, (5) calmness-with the first three being the best for capturing the integral quality [2]. Apart from the naturalness and understandability of contents, listening tests can also measure the distinguishability of characters or the degree of entertainment [3]. The subjective scales for rating the synthesized speech may include only a few scored parameters, such as an overall impression by a mean opinion score (MOS) describing the perceived speech quality from poor to excellent, a valence from negative to positive, and an arousal from unexcited to excited [4]. The MOS scale can be used not only for naturalness, but for different dimensions, such as affect (from negative to positive) or speaking style (from irritated to calm) as well [5]. The comparison of a pair of utterances synthesized by different methods or originating from different speech inventories is often carried out by a preference listening test [6]. For objective speech quality estimation of the TTS voice, various speech features extracted from the natural and Reference and Anchor (MUSHRA) listening test [19] for the comparison of speech stimuli using hidden original speech, as well as anchors with different impairments.
An auxiliary analysis was carried out to reveal a possible influence of the number of mixture components, the number of synthetic sentences tested, the types of speech features, the types of audio databases for GMM creation, and the dispersion of positions of original utterances in the P-A space on the partial results of the continual GMM P-A classification, as well as on the stability and the accuracy of the final evaluation results. In addition, the influence of the number of mixtures used for GMM creation and training together with 2D classification in the P-A space on the computational complexity (CPU processing time) was investigated. The experiments realized confirm the suitability of the method for this type of task as well as the principal functionality of the system developed.

Emotion Evaluation and Distribution in the Pleasure-Arousal Space
Acoustic stimuli, such as noise, speech, or music induce specific emotional states in listeners. These emotions may be classified from a discrete or a dimensional perspective [20]. In the discrete model, six basic emotions are usually recognized: joy, sadness, surprise, fear, anger, and disgust [21]. The dimensional model represents all possible emotions on a two-dimensional or three-dimensional scale. The first dimension is Pleasure ranging from negative to positive feelings, the second dimension is Arousal referring to alertness and activity with the range from calm to excited states, and the third dimension is Dominance describing emotional states from being controlled to controlling [22]. For the discrete emotions mapped in the space of first two dimensions, the negative emotions of anger and sadness correspond to low Pleasure, positive emotions such as surprise and joy, have high Pleasure, passive apathetic emotions are characterized by the lowest Arousal, and frantic excitement corresponds to the highest Arousal [23].
Using these first two dimensions, the 2D diagram in a Pleasure-Arousal (P-A) space [24] is divided into four emotion quadrants (EQ 1 -EQ 4 ) that can be categorized as EQ 1 = pleasant with high intensity of feeling, EQ 2 = unpleasant with high intensity; EQ 3 = unpleasant with low intensity; EQ 4 = pleasant with low intensity. In relation to pleasantness and feeling intensity, the basic importance weights for each of the emotion quadrants were defined as documented in Figure 1. This approach is used in further analysis for the determination of the final evaluation decision. of acoustic noise. Thus, the automatic evaluation system developed must satisfy the requirements for the comparison of speech synthesis approaches with essentially different acoustic realizations. In this experiment, the objective results were compared with the subjective ones based on the subjective assessment called MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) listening test [19] for the comparison of speech stimuli using hidden original speech, as well as anchors with different impairments.
An auxiliary analysis was carried out to reveal a possible influence of the number of mixture components, the number of synthetic sentences tested, the types of speech features, the types of audio databases for GMM creation, and the dispersion of positions of original utterances in the P-A space on the partial results of the continual GMM P-A classification, as well as on the stability and the accuracy of the final evaluation results. In addition, the influence of the number of mixtures used for GMM creation and training together with 2D classification in the P-A space on the computational complexity (CPU processing time) was investigated. The experiments realized confirm the suitability of the method for this type of task as well as the principal functionality of the system developed.

Emotion Evaluation and Distribution in the Pleasure-Arousal Space
Acoustic stimuli, such as noise, speech, or music induce specific emotional states in listeners. These emotions may be classified from a discrete or a dimensional perspective [20]. In the discrete model, six basic emotions are usually recognized: joy, sadness, surprise, fear, anger, and disgust [21]. The dimensional model represents all possible emotions on a two-dimensional or three-dimensional scale. The first dimension is Pleasure ranging from negative to positive feelings, the second dimension is Arousal referring to alertness and activity with the range from calm to excited states, and the third dimension is Dominance describing emotional states from being controlled to controlling [22]. For the discrete emotions mapped in the space of first two dimensions, the negative emotions of anger and sadness correspond to low Pleasure, positive emotions such as surprise and joy, have high Pleasure, passive apathetic emotions are characterized by the lowest Arousal, and frantic excitement corresponds to the highest Arousal [23].
Using these first two dimensions, the 2D diagram in a Pleasure-Arousal (P-A) space [24] is divided into four emotion quadrants (EQ1-EQ4) that can be categorized as EQ1 = pleasant with high intensity of feeling, EQ2 = unpleasant with high intensity; EQ3 = unpleasant with low intensity; EQ4 = pleasant with low intensity. In relation to pleasantness and feeling intensity, the basic importance weights for each of the emotion quadrants were defined as documented in Figure 1. This approach is used in further analysis for the determination of the final evaluation decision.

Creation of Gaussian Mixture Models for Pleasure-Arousal Classes
The proposed evaluation method is based on the determination and statistical analysis of distances between originals (from a speaker) and the tested synthetic speech in the P-A space with the help of the GMM classifier. The data investigated are approximated Appl. Sci. 2021, 11, 2 4 of 18 by a linear combination of Gaussian probability density functions [25]. They are used to calculate the covariance matrix as well as the vectors of means and weights. Next, the clustering operation is performed to organize objects into groups whose members are similar in some way. Two basic algorithms may be used in this clustering process: (i) k-means clustering-dividing the objects into k clusters so that some metric relative to the centroids of the clusters is minimized, (ii) spectral clustering-finding data points as nodes of a connected graph and partitioning this graph into sub-graph clusters based on their spectral decomposition [26].
In practice, for initialization of the GMM model parameters the k-means algorithm determining the centers is usually used-this procedure is repeated several times until a minimum deviation of the input data sorted in k clusters S = {S 1 ,S 2 , . . . , S k } is found. Subsequently, the iteration algorithm of expectation-maximization is used to determine the maximum likelihood of the GMM. The number of mixtures (N MIX ) and the number of iterations (N ITER ) have an influence on the execution of the training algorithm-mainly on the time duration of this process and on the accuracy of the output GMMs obtained.
The preparation as well as evaluation phases begin with the analysis of the input sentences yielding various speech/sound properties. Four types of signal features are determined in the proposed system: time duration, prosodic, basic spectral and supplementary spectral parameters. The analyzed signal is processed in overlapping segments. The determined pitch (F0) contour can be divided into N voiced parts and N + 1 unvoiced parts of various durations to obtain different types of time duration (TDUR) features [27]. Apart from the TDUR features, the contours of F0 and signal energy are used to determine standard prosodic (PROS) parameters. Other types of signal features are spectral features (SPEC1), computed using the spectral and cepstral analysis of each input frame, and spectral high-level statistical parameters (SPEC2). The representative statistical values (median, range, standard deviation-std, relative maximum and minimum, etc.) of these features compose the input vector of N FEAT features for GMM processing. The speech and non-speech sounds are used for the creation and training of the output GMM models specified by the number of Pleasure classes N PC and Arousal classes N AC -see the block diagram in Figure 2.
ysis of distances between originals (from a speaker) and the tested synthetic speech in the P-A space with the help of the GMM classifier. The data investigated are approximated by a linear combination of Gaussian probability density functions [25]. They are used to calculate the covariance matrix as well as the vectors of means and weights. Next, the clustering operation is performed to organize objects into groups whose members are similar in some way. Two basic algorithms may be used in this clustering process: k-means clustering-dividing the objects into k clusters so that some metric relative to the centroids of the clusters is minimized, (ii) spectral clustering-finding data points as nodes of a connected graph and partitioning this graph into sub-graph clusters based on their spectral decomposition [26].
In practice, for initialization of the GMM model parameters the k-means algorithm determining the centers is usually used-this procedure is repeated several times until a minimum deviation of the input data sorted in k clusters S = {S1,S2, …, Sk} is found. Subsequently, the iteration algorithm of expectation-maximization is used to determine the maximum likelihood of the GMM. The number of mixtures (NMIX) and the number of iterations (NITER) have an influence on the execution of the training algorithm-mainly on the time duration of this process and on the accuracy of the output GMMs obtained.
The preparation as well as evaluation phases begin with the analysis of the input sentences yielding various speech/sound properties. Four types of signal features are determined in the proposed system: time duration, prosodic, basic spectral and supplementary spectral parameters. The analyzed signal is processed in overlapping segments. The determined pitch (F0) contour can be divided into N voiced parts and N + 1 unvoiced parts of various durations to obtain different types of time duration (TDUR) features [27]. Apart from the TDUR features, the contours of F0 and signal energy are used to determine standard prosodic (PROS) parameters. Other types of signal features are spectral features (SPEC1), computed using the spectral and cepstral analysis of each input frame, and spectral high-level statistical parameters (SPEC2). The representative statistical values (median, range, standard deviation-std, relative maximum and minimum, etc.) of these features compose the input vector of NFEAT features for GMM processing. The speech and non-speech sounds are used for the creation and training of the output GMM models specified by the number of Pleasure classes NPC and Arousal classes NAC-see the block diagram in Figure 2.  During the classification process, the input vectors from the analyzed sentence are passed to the GMM classifier block to obtain the scores (T, m) that are subsequently quantized to discrete levels corresponding to N PC /N AC output P-A classes. This approach is carried out for each of M frames of the analyzed sentence to obtain output vectors of winner P-A classes-see the block diagram in Figure 3. During the classification process, the input vectors from the analyzed sentence a passed to the GMM classifier block to obtain the scores (T, m) that are subsequently qua tized to discrete levels corresponding to NPC/NAC output P-A classes. This approach is c ried out for each of M frames of the analyzed sentence to obtain output vectors of winn P-A classes-see the block diagram in Figure 3.

Description of the Proposed Automatic Evaluation System
The functional structure of the proposed automatic system can be divided into t preparation and the main evaluation parts. Within the preparation part, the following tw operations are preformed: (1) Creation and training of GMM models [25] of NPC Pleasure classes and NAC Arou classes using the material from the speech and sound databases. where 1 ≤ k ≤ K and K is the total number of the processed original sentences. The main evaluation part consists of the GMM classification operations applied the synthetic speech sentences produced by different synthesis methods Synt1, Syn Synt3 … etc. Output values representing their actual position in the P-A space are sub quently processed to obtain the final evaluation order (FEO) decision as shown in t block diagram in Figure 4. The whole evaluation process can be described by the followi five operations:

Pleasure classes Arousal classes
Signal analysis

Description of the Proposed Automatic Evaluation System
The functional structure of the proposed automatic system can be divided into the preparation and the main evaluation parts. Within the preparation part, the following two operations are preformed: (1) Creation and training of GMM models [25] of N PC Pleasure classes and N AC Arousal classes using the material from the speech and sound databases. (2) These GMM models are used in the preliminary classification process to determine the individual coordinates [Pco(k), Aco(k)] of the original sentences in the P-A space and the resulting 2D center position [C PO , C AO ] as: where 1 ≤ k ≤ K and K is the total number of the processed original sentences. The main evaluation part consists of the GMM classification operations applied on the synthetic speech sentences produced by different synthesis methods Synt1, Synt2, Synt3 . . . etc. Output values representing their actual position in the P-A space are subsequently processed to obtain the final evaluation order (FEO) decision as shown in the block diagram in Figure 4. The whole evaluation process can be described by the following five operations: where N is the total number of the processed synthetic sentences. The FV vector can be also expressed in the polar coordinates by its magnitude (M FV ) and angle (φ FV ) in degrees: where N is the total number of the processed synthetic sentences. The FV vector can be also expressed in the polar coordinates by its magnitude (MFV) and angle (ϕFV) in degrees: The FV obtained is subsequently localized inside four emotional quadrants EQ1-EQ4 around the center of originals (see Figure 5c) with a corresponding emotional meaning in relation to the 2D emotional space (compare with the diagram in Figure 1).
(4) Determination of the summary distribution parameters (SDP) from the FV magnitude and angle for all NTST tested synthesis types as: where IWEQ1-4 are the importance weight functions depending on the quadrants EQ1-4 determined from the FV angle values (see Figure   The FV obtained is subsequently localized inside four emotional quadrants EQ 1 -EQ 4 around the center of originals (see Figure 5c) with a corresponding emotional meaning in relation to the 2D emotional space (compare with the diagram in Figure 1).
(4) Determination of the summary distribution parameters (SDP) from the FV magnitude and angle for all N TST tested synthesis types as: where IW EQ1-4 are the importance weight functions depending on the quadrants EQ 1-4 determined from the FV angle values (see Figure 1): In all quadrants, the transformation functions IW EQ1-4 are defined by the weights corresponding to the angles of the quadrant center and of the quadrant borders. The complete transformation functions IW EQ1-4 are calculated using the linear interpolation in the angle steps of one degree.
(5) Determination of the final evaluation decision is based on the sorted sequence SO TST (i) with ascending SDP values for N TST tested synthesis types. To determine possible similarities in the evaluated synthesis types, the differences Dso between the sorted SO TST values are calculated. Small Dso values below the threshold D THRESH indicate the "similarity" result. The final evaluation order of three types of the synthesis method tested is then determined as: where Dso X-Y represents the difference between the Xth and the Yth rank in the order of sorted SO TST values.
The Dso can theoretically reach up to 200% for SO TST values in quadrants EQ 1 /EQ 2 with opposite importance weighs 1/−1 (see Figure 1). The first rank ("1") denotes the maximum proximity of the tested synthesis to the original and the last rank ("3"-for N TST = 3) represents the maximum difference between the synthesis and the original. The similarities between two or more following ranks are denoted as "1/2", "2/3" . . . etc. A possible notation of the obtained final result can be written as FEO(Synt1, Synt2, Synt3) = {"2", "1", "3"} for well differentiated SO TST values or FEO(Synt1, Synt2, Synt3) = {"1/2", "1/2", "3"} for detected similarity between the first and the second evaluated synthesis types. In the first case, Synt2 is the best, Synt3 is the worst. The second example result means that Synt1 and Synt2 are similar, and Synt3 is the worst. The visualization of sum vectors processing to obtain the FEO decision for two types of synthesis is shown in Figure 6.

Material Used, Initial Settings, and Conditions
To evaluate synthetic speech quality by continual classification in the P-A scale, we collected the first speech corpus (SC1) consisting of three parts: the original speech uttered by real speakers, and two variations of speech synthesis produced by the Czech TTS system using the USEL method [16] with voices based on the original speaker. Two methods of prosody manipulation were applied: the rule-based method (assigned as TTS A ) and the modified version reflecting the final syllable status (as TTS B ) [11]. The natural as well as the synthetic speech originates from four professional speakers-two males (M1, M2) and two females (F1, F2). Declarative sentences were used for each of four original speakers (50 + 50/50 + 50; it means 200 in total). As regards the synthesis, we used 2 × 50/40 (for M1/M2) and 2 × 40/40 (for F1/F2) sentences of two synthesis types from each of the four voices-340 in total for all the voices. The speech signals were sampled at 16 kHz and their duration ranged from 2.5 to 5 s.
The second collected speech corpus (SC2) consists of four parts: the natural speech uttered by the original speakers and three variations of speech synthesis: the USEL based TTS system (assigned to Synt1) and two LSTM based systems with different vocoders: conventional WORLD (further referred to as Synt2) [16], WaveRNN (referred to as Synt3) [17]. As in the case of SC1, the original and synthetic speech originated from the speakers M1, M2, and F1, F2. This means, that 200 original sentences and 600 synthetic ones (200 for each of the synthesis types) were used in this work. The processed synthetic speech signals with the duration from 2 to 12 s were resampled at 16 kHz. The detailed description of the speech material used is provided in Table 1. To create and train the GMM models of the Pleasure/Arousal classes, two separate databases were used. The first was the International Affective Digitized Sounds (IADS-2) [28] database (further referred to as DB 1 ). It consists of 167 sound and noise records produced by humans, animals, simple instruments, the industrial environment, weather, music, etc. Every sound was repeatedly evaluated by listeners, so the database contains the mean values of Pleasure and Arousal parameters within the range of <1~9>. All the records of sounds used with the duration of 6 s were resampled at 16 kHz to comply with the tested as well as original speech signals. In this case, the GMM models are common for male and female voices. The second database used was the MSP-IMPROV audiovisual database [29] in the English language (further referred to as DB 2 ). From this database, we used only declarative sentences in four emotional states (angry, sad, neutral, and happy) uttered by three male and three female speakers. Finally, 2 × 240 sentences (separately for male and female voices) with duration from 0.5 to 6.5 s were used. For compatibility with the DB 1 , all of the applied speech signals were resampled at 16 kHz and the mean P-A values were recalculated to fit the range <1~9> of the DB 1 . These two databases were used because they contain all the records with evaluation results on the P-A scale and were freely accessible without any fee or other restrictions.
The speech/sound signal analyzed is processed by a pitch-asynchronous method per frame with one half overlapping. The frame length of 24/20 ms was used for male/female voices according to F0 values of the current speaker-see the second column in Table 1. For the calculation of spectral and cepstral properties, the number of fast Fourier transform (FFT) points was set to N FFT = 1024. A detailed list of the speech features used grouped by type is shown in Table 2.
From these four types of features, four feature sets P0, P2, P4, and P42 were constructed for application in the GMM building part, as well as for classification in the main evaluation process. In correspondence with [10], all input feature vectors consisted of N FEAT = 16 representative statistical parameters of speech features-see Table 3.    (7) The number of P-A classes was reduced to N PC = 7 and N AC = 5 so that the data of both tested databases were approximately evenly distributed. The similarity threshold D THRESH for FEO determination was empirically set to 5%. The values of importance weights together with the angles of the central and border definition points for functions IW EQ1-4 are shown in Table 4. Finally, the transformation curves were constructed using linear interpolation, as demonstrated graphically in Figure 7. In the GMM-based creation, training and classification process, a diagonal covariance matrix was selected due to its lower computational complexity. These program procedures were realized with the help of the "Netlab" pattern analysis toolbox [30] and the whole proposed automatic evaluation system was implemented in the Matlab computing system (ver. 2016b). The computational complexity was investigated using the UltraBook Lenovo  In the GMM-based creation, training and classification process, a diagonal covariance matrix was selected due to its lower computational complexity. These program procedures were realized with the help of the "Netlab" pattern analysis toolbox [30] and the whole proposed automatic evaluation system was implemented in the Matlab computing system (ver. 2016b). The computational complexity was investigated using the UltraBook Lenovo Yoga consisting of an Intel(R) Intel i5-4200U processor operating at 2.30 GHz, 8 GB RAM, and Windows 10.

Experiments Performed and the Results Obtained
Experiments in this research were realized in two steps. An auxiliary analysis had to be performed before the main evaluation. The first part of the preliminary investigations was motivated by seeking an appropriate setting of control parameters for the GMMbased classification process. The positions of the originals in the P-A space were analyzed statistically using the class centers [CPO, CAO] and their dispersions represented by the std values stdPO, stdAO. As the originals were the same for both testing speech corpora SC1 and SC2, the results obtained are applicable in all our next evaluation experiments. The second part focused on the functionality testing of the whole evaluation process. These investigations were performed using the speech corpus SC2 and three types of synthesis methods (Synt1, Synt2, and Synt3).
The first part of the auxiliary experiments consists of the following three investigations areas:

Experiments Performed and the Results Obtained
Experiments in this research were realized in two steps. An auxiliary analysis had to be performed before the main evaluation. The first part of the preliminary investigations was motivated by seeking an appropriate setting of control parameters for the GMMbased classification process. The positions of the originals in the P-A space were analyzed statistically using the class centers [C PO , C AO ] and their dispersions represented by the std values std PO , std AO . As the originals were the same for both testing speech corpora SC1 and SC2, the results obtained are applicable in all our next evaluation experiments. The second part focused on the functionality testing of the whole evaluation process. These investigations were performed using the speech corpus SC2 and three types of synthesis methods (Synt1, Synt2, and Synt3).
The first part of the auxiliary experiments consists of the following three investigations areas:

1.
Comparison of computational complexity expressed by CPU times of GMM creation and training and CPU times of GMM 2D classification of originals in the P-A space for N MIX = {8, 16, 32, 64, 128, 256, and 512} and for both databases (DB 1 and DB 2 ); obtained results are presented numerically in Tables 5 and 6.

2.
Mapping of the effect of the number of Gaussian mixtures on the obtained std PO and std AO values of originals-see the summary comparison for both databases with the voices M1 and F1, using the feature set P4 in Figure 8  In the second part of the preliminary investigations, we tested the setting of other parameters with a possible influence on the stability of the partial results and the final decision of the main evaluation experiments. We analyzed and compared several values obtained from the sum vectors: magnitudes and angles, SDPs after weighting in agreement with the localized emotion quadrants, order differences Dso, and final decisions FEO. For these values, we analyzed the influence of: The type of the database (DB 1 /DB 2 ) for training of the GMMs in the case of comparison of two methods of prosody manipulation in the TTS system (TTS 1 /TTS 2 )-see the numerical comparison of partial evaluation parameters as well as the FEO decisions using N MIX = 128, and the feature set P4 for the M1 voice in Table 7, and for the F1 voice in Table 8.
The main evaluation consists of a summary comparison between the objective results by the proposed system and the subjective results achieved using the standard listening test method. In these final experiments, the sentences of the synthetic speech extracted from both corpora SC1 and SC2 and all four voices were tested, while the original sentences from speakers were the same for both corpora. In the case of the sentences from the SC1, the GMM-based results were compared with the subjective results by a large three-scale preference listening test. This test compared two versions of the same utterance synthesized by TTS A and TTS B prosody generation methods. The listeners had to choose whether "A sounds better", "A sounds similar to B", or "B sounds better". The evaluation set was formed by 25 pairs of randomly selected synthetic sentences for each of four synthetic voices, so 100 sentences were compared in total. Twenty-two evaluators (of which seven were speech synthesis experts, six were phoneticians and nine were naive listeners) participated in this subjective listening test experiment. The evaluation carried out is described in more detail in [11]. The final results of the automatic evaluation system based on the GMM classification in the P-A space are compared visually with the evaluation results of the standard listening tests in the bar-graphs in Figure 11.
In the second subjective evaluation (the MUSHRA listening test), multiple audio stimuli were used for the comparison of the synthesis tested with a high quality reference signal and impaired anchor signals resembling the system's artifacts. Both the reference and the anchor signals were hidden from the listener. The subjective audio quality of the speech recordings was scored according to the continuous quality scale with the range from 0 (poor) to 100 (excellent). For each of the four speakers and each of the 10 sets of utterances, there were four sentences to be scored by the listener. One of them was uttered in high-quality original speech Orig and the three remaining ones were synthesized by the methods Synt1, Synt2, Synt3. This test, consisting of the same utterances for every listener, was undertaken by 18 listeners, with 8 of them having experience in speech synthesis [17]. The graphical comparison of the GMM-based evaluation results with the subjective results by the MUSHRA listening test can be found in Figure 12.   were speech synthesis experts, six were phoneticians and nine were naive listeners) participated in this subjective listening test experiment. The evaluation carried out is described in more detail in [11]. The final results of the automatic evaluation system based on the GMM classification in the P-A space are compared visually with the evaluation results of the standard listening tests in the bar-graphs in Figure 11. In the second subjective evaluation (the MUSHRA listening test), multiple audio stimuli were used for the comparison of the synthesis tested with a high quality reference signal and impaired anchor signals resembling the system's artifacts. Both the reference and the anchor signals were hidden from the listener. The subjective audio quality of the speech recordings was scored according to the continuous quality scale with the range from 0 (poor) to 100 (excellent). For each of the four speakers and each of the 10 sets of utterances, there were four sentences to be scored by the listener. One of them was uttered in high-quality original speech Orig and the three remaining ones were synthesized by the methods Synt1, Synt2, Synt3. This test, consisting of the same utterances for every listener, was undertaken by 18 listeners, with 8 of them having experience in speech synthesis [17]. The graphical comparison of the GMM-based evaluation results with the subjective results by the MUSHRA listening test can be found in Figure 12. The listening test evaluations were carried out previously between the years 2017 and 2019 for different research purposes [11,17]. In both of the tests, the order of the utterances was randomized in each of the ten sets so that the synthesis method was not known to the listener in advance. The listening to every audio stimulus was repeatable before the selection of the listener's rating. Headphones and quiet ambience were recommended for listening. Neither the gender nor the age of the listener was important in the subjective evaluation, but a background in speech synthesis played an essential role.

Discussion of the Obtained Results
The detailed comparison of computational complexity demonstrates a great increase in CPU time for GMM creation and training using higher number of mixtures NMIX. To obtain GMMs for seven Pleasure or five Arousal classes using the sound database (IADS-2), the necessary CPU time was 43 s for eight mixture components and about 1890 s for NMIX = 512 (see first two rows in Table 5) representing a 44-fold increase. With the speech database (MSP-IMPROV), separate models for male and female voices were created, hence the differences in the CPU times are halved: about 19 s for NMIX = 8 and 900 s for the maximum of 512 mixtures, (approx. 47-fold increase). The situation is similar for both voices-male and female ones. For 2D GMM classification of original sentences of real speakers (a set of 50 in total) with these models, the CPU times are about 7 times lower, however, 250 s for the maximum NMIX = 512 is still too high-beyond the possibility of The listening test evaluations were carried out previously between the years 2017 and 2019 for different research purposes [11,17]. In both of the tests, the order of the utterances was randomized in each of the ten sets so that the synthesis method was not known to the listener in advance. The listening to every audio stimulus was repeatable before the selection of the listener's rating. Headphones and quiet ambience were recommended for listening. Neither the gender nor the age of the listener was important in the subjective evaluation, but a background in speech synthesis played an essential role.

Discussion of the Obtained Results
The detailed comparison of computational complexity demonstrates a great increase in CPU time for GMM creation and training using higher number of mixtures N MIX . To obtain GMMs for seven Pleasure or five Arousal classes using the sound database (IADS-2), the necessary CPU time was 43 s for eight mixture components and about 1890 s for N MIX = 512 (see first two rows in Table 5) representing a 44-fold increase. With the speech database (MSP-IMPROV), separate models for male and female voices were created, hence the differences in the CPU times are halved: about 19 s for N MIX = 8 and 900 s for the maximum of 512 mixtures, (approx. 47-fold increase). The situation is similar for both voices-male and female ones. For 2D GMM classification of original sentences of real speakers (a set of 50 in total) with these models, the CPU times are about 7 times lower, however, 250 s for the maximum N MIX = 512 is still too high-beyond the possibility of real-time processing. For the results obtained in the classification phase, the CPU times are affected neither by the voice (male/female) nor by the database (DB 1 /DB 2 ), as documented in Table 6.
The analysis of the effect of the number of Gaussian mixtures on the obtained dispersion of the originals' centers expressed by the std PO and std AO values has shown their monotonous decrease-see the graphs in Figure 8. The falling trend is the same for the male (M1) as well as the female (F1) voices, greater differences are observed for the DB 2 used. For maximum accuracy of the evaluation results, low std PO and std AO values are necessary. It is practically fulfilled for the sound database in the case of N MIX = 128 and for the DB 2 using N MIX = 512. With respect to the CPU times, we have finally chosen N MIX = 128 to be used as a compromise value in further experiments (with CPU times for GMM classification being about 0.5 s per one sentence tested).
The next auxiliary analysis of dispersion of the originals around the centers dealt with different feature sets used for GMM classification. As can be seen in a box-plot comparison in Figure 9, lower mean values of std PO and std AO parameters are achieved with P0 and P4 sets for both databases (DB 1 , DB 2 ). Considering the structure of the feature sets in Table 3, we finally decided to use the set P4 with a more balanced distribution of speech features (prosodic, spectral, and temporal types).
For practical testing of the functionality of the evaluation system we calculated and compared partial results comprising centers of originals, M FV and φ FV of sum vectors, summary distribution parameters, differences Dso X-Y and FEO decisions for M1 and F1 voices depending on the sound/speech database used (see Tables 7 and 8). The M FV parameters in the second columns of both tables show similar values for both types of prosody manipulation. For better discrimination between them, the emotional quadrant importance weights are applied. In principle, it increases the complexity of the whole evaluation algorithm. On the other hand, consideration of the location in emotional quadrants EQ 1-4 is justified in a psychological perception of the synthetic speech by human listeners. This is the main criterion for evaluation of the synthetic speech quality primarily for the listening test methods however, the objective evaluation approaches must respect this influence, too. The importance weights nw0,1,2 chosen for the transformation functions IW EQ1-4 (see Table 4) and subsequent scaling of the M FV values provide the required effect-greater separation of these parameters. It is well documented in the case of the DB 2 with the M1 voice (see the last two rows in Table 7) where a simple difference between the M FV values of TTS 1 and TTS 2 is about 0.03, but the sum vectors lie in the opposite quadrants (EQ 1 /EQ 3 ), so the SDP values have opposite signs and the value of 193% is finally assigned to the parameter Dso. The same effect is shown also for the female voice F1-in this case the Dso values are smaller, but still safely over the chosen 5% similarity threshold as documented by the results in the last but one column of Table 8.
From the last auxiliary investigation follows that a minimum number of 25 sentences (one half of a full set) must be processed to achieve proper partial as well as final evaluation parameters. The values in Table 9 demonstrate that for a lower number of sentences the final decision would not be stable giving either the wrong evaluation order (for N TS = 10) or no useful information because of the similarity of the category "1/2" (for N TS = 25). For compatibility between the evaluations using both testing synthetic speech corpora (SC1 a SC2) only the full sets consisting of 50 sentences for each voice were applied in further analysis.
The final comparison of the evaluation experiment using sentences of the speech corpus SC1 with the results obtained by the standard listening test method described in more detail in [11] shows principal correspondence as documented by the graphs in Figure 11. While the results for the M1, F1, and M2 voices are stable and prefer the TTS 2 method for both databases, for the F2 voice the results are classified as similar in the TTS 1 as well as the TTS 2 . As follows from the comparison of center positions of originals and their dispersions in the P-A scale presented in Figure 10, for the F2 voice the std PO and std AO parameters achieve the greatest values. This voice has also the smallest evaluation percentage by the listening test (about 53% vs. the best evaluated voice F1 with 65%) as shown in Figure 11c.
The final objective results of the second evaluation based on testing sentences of the speech corpus SC2 show some differences when compared with the MUSHRA listening test. The graphs in Figure 12a,b document that our GMM-based automatic system marks the synthesis Synt2 (LSTM with the WORLD vocoder) as the worst one in all cases, the synthesis Synt1 (USEL) as the best (excluding the F2 voice), and the Synt3 (WaveRNN) of a medium quality. For the female voice F2, the results are different depending on the training database used for GMMs. For the sound database DB 1 , the quality order is exchanged for Synt1 and Synt3 types (Synt3 is the best and Synt1 is medium). Using the speech database DB 2 generates the result of similarity between Synt1 and Synt3 synthesis types. Generally, it can be said that using the speech database DB 2 generates smaller dispersion of localized positions and hence it brings better evaluation results of Dso parameters and stable FEO decisions.
Contrary to it, the listening tests rated the Synt3 as the best, then the Synt1 as medium, and the Synt2 as the worst-see the 3D bar-graph in Figure 12c. It also indicates similarity between Synt1 and Synt2 types for the female voice F2 (MUSHRA scores are 48.5% vs. 48.9% [17]). Our speech features used for GMM-based evaluation apparently reflect better naturalness of the USEL synthesis using units of original speech recordings, although it causes undesirable artifacts due to concatenation of these units [19]. From this point of view, the DNN is less natural as it uses a model to generate the synthetic speech, but the WaveRNN based on a DNN vocoder is more natural as it uses a complex network for direct mapping between the parametric representation and the speech samples. This is probably a reason for a simpler LSTM with the WORLD vocoder being more averaged smoothed and less natural. The result of the Synt3 being better than the Synt2 was expected, too. The listening test comparison of the WaveRNN and the USEL is generally more subjective.

Conclusions
The task of the synthetic speech quality determination by objective measures has been successfully fulfilled by the designed automatic system with continual evaluation on the 2D P-A scale and practical verification on two corpora of the synthetic speech generated by the Czech TTS system. We have theoretical knowledge about a better type of the synthesis (prosody manipulation in the TTS system), but the subjective evaluation performed can show a different opinion of listeners, even though the results of the objective evaluation by this proposed system are generally in correspondence with the theory. The benefit of the proposed method is that the sound/speech material used to create and train the GMMs for P-A classification can be totally unrelated to the synthetic speech tested. The sentences from the original speaker also need not be syntactically or semantically related to the sentences of the TTS system evaluated.
The currently developed automatic evaluation system uses a statistical approach and its kernel is practically based on the GMM classifier. The GMM can describe a distribution of given data using a simple k-means method for data clustering implemented in the Netlab toolbox [30]. We automatically expect that all components have Gaussian distributions but their linear combination can approximate non-Gaussian probability distributions for each of the processed P-A classes. In addition, we use a fixed number of mixtures for GMMs without discrimination between the Pleasure/Arousal types of classes and the gender of a speaker (male/female). At present, we are not able to confirm assumption about real distribution of the processed data, so statistical parameters of the training data represented by values in the feature vectors must be investigated in detail. The newer, more complex and more precise method based on spectral clustering [26] can solve this potential problem, so we will try to implement this approach into the GMM creation and training algorithm. Last, but not least, we would like to test adaptive setting of the training procedure (N MIX , N ITER , and N FEAT parameters) depending on the currently used training data reflecting also the language characteristics (differences in time-duration as well as prosodic parameters).
The limitation of the present work lies in the fact that the size of both synthetic speech databases evaluated was relatively small and more sentences must be tested to evaluate the real performance of the proposed automatic system. The second problem is the practical impossibility of direct comparison of our final results with the other subjective evaluation approaches due to incompatible expression of results (in the case of the MUSHRA test) or absence of percentage values (for comparison with the listening test in the form of a confusion matrix). The output of our automatic evaluation system in the form of FEO decisions representing symbolical distances in the 2D P-A space between originals (from a speaker) with the added aspect of subjective emotional meaning by the location in four emotional quadrants. Next, the parameters Dso 1-2,2-3 determining differences between the first and the second rank and the second and the third rank in the order are expressed in percentage but, due to the application of emotion quadrant weights, they can reach up to 200%.
From the practical point of view, it would be useful to provide an evaluation of the overall computational complexity of the method used in our evaluation process, together with its real-time capabilities, as well as the performance testing of the whole automatic evaluation system. The current realization in the Matlab environment is not very suitable for the building of the application running under Windows or others platforms. If the critical points were found, the whole evaluation algorithm would be implemented in one of the higher programming languages such as C++, C#, Java, etc.
Considering the limitation of the current work and its potential for practical use by other researchers we plan to build larger speech corpora and perform next evaluation experiments with the aim to find any fusion method how to enable comparison with the results obtained from the evaluation of the listening test. The Czech TTS system tested is also able to produce synthetic speech in the Slovak language (similar to Czech) [16,31]; therefore, we also suppose the application of Slovak in this proposed automatic evaluation system. Finally, we will attempt to collect speech databases directly in the Czech (Slovak) languages with sentences labeled on the P-A scale for the subsequent creation of GMM models used in the continuous P-A classification.