The Intelligibility Benefits of Modern Computer-Synthesized Speech for Normal-Hearing and Hearing-Impaired Listeners in Non-Ideal Listening Conditions

: Speech intelligibility is a concern for public health, especially in non-ideal listening conditions where listeners often listen to the target speech in the presence of background noise. With advances in technology, synthetic speech has been increasingly used in lieu of actual human voices in human–machine interfaces, such as public announcement systems, answering machines, virtual personal assistants


Introduction
Synthetic speech generated by computers has been increasingly used in lieu of actual human voices in human-machine interfaces.Nearly all digital assistants, such as Amazon Alexa and Apple Siri, on contemporary mobile devices, personal computers, and consumer electronics interact with users using synthetic voices.GPS navigators also use synthetic voices to guide their users.However, the information exchange between humans and machines often takes place in non-ideal (e.g., noisy and reverberant) listening conditions.For example, drivers often hear the instructions given by the GPS mixed with surrounding sounds, such as engine, road, and wind noises.Therefore, speech intelligibility in similar scenarios is crucial, especially in safety-critical environments.Investigating the acoustic traits that indicate speech intelligibility has continued for decades; measuring and improving the intelligibility of synthetic speech has been heavily studied since the emergence of speech synthesizers [1,2].In the past, it was found that synthetic speech was less intelligible than natural speech in quiet conditions [2] and was even less so in non-ideal listening conditions [3,4].As a public health concern, the intelligibility issue of synthetic voices in everyday life, especially for the hard-of-hearing population, has drawn considerable attention.
Enormous efforts have been devoted to designing novel algorithms for speech synthesis.The earliest speech synthesizers were developed from rule-based models that filter a source signal to resemble speech-to-data-based models that concatenate segments from recorded human voices [5,6].However, speech generated by these synthesizers is highly unnatural, especially in the prosody domain.Statistical parametric synthesis (SPS) was proposed to address these problems [7].SPS and subsequently developed neural models directly synthesize speech waveforms instead of concatenating units from natural speech.A typical SPS pipeline includes three components: a text analyzer that extracts linguistic features from the text, an acoustic model that translates linguistic features to acoustic features, and a vocoder that outputs a waveform based on the acoustic features.The neural synthesizers that came later attempted to simplify the above pipeline and strove toward directly producing waveforms from the text.Synthetic voices generated by state-of-theart neural synthesizers have demonstrated significantly better quality and naturalness compared to traditional SPS speech [7,8].
Concerning speech intelligibility in general, there have been many studies and investigations into which acoustic features are crucial to it and why.For clear speech, for instance, the speaking rate and vowel space are negatively and positively correlated with intelligibility, respectively [9,10].Moreover, the speaking rate also has an impact on vowel space.Typically, as the speaking rate increases, speech intelligibility and vowel space decrease [11].The harmonics-to-noise ratio, a measure of voice quality, is also a strong predictor of speech intelligibility [12][13][14].
Many studies have also been conducted on intelligibility in noise for both natural speech and synthetic speech.When speaking in noise, humans spontaneously make articulatory changes to produce speech adapted to the listening environment, known as "Lombard speech" [15].Lombard speech has been extensively studied, and it is often more intelligible than ordinary speech produced in quiet.Compared to ordinary speech, Lombard speech typically has increased intensity and fundamental frequency (F0), prolonged vowel duration, and an upward shift of energy towards the middle or higher frequencies [16,17].These acoustic changes are understood to help speech sounds better overcome the masking effect of noise.In the early 1980s, the intelligibility of speech generated by two rule-based synthesizers and that of a natural voice in white noise were compared.Although the intelligibility of some phonemes was statistically similar for all three voices, the advantage of natural speech was evident [2].A more recent study evaluated the intelligibility of algorithmically-modified natural speech and SPS speech in a series of noise conditions.Some of the synthesizers were particularly adapted to produce synthetic speech with the acoustics properties of Lombard speech.Results showed that natural speeches with spectral modification and reduced dynamic range could outperform Lombard speech in noise.Synthetic speech was, however, always less intelligible than natural speech in noise [4].Further investigation showed that enhancing important regions (e.g., formants) in the spectral envelope and using Lombard-adapted duration and excitation for synthesis could improve the intelligibility of synthetic speech [18].However, it was also noted in [18] that the improvement of intelligibility in noise due to acoustic modifications came at the cost of naturalness and sound quality of the speaker's voice when perceived in quiet.Nevertheless, speech with better intelligibility tends to be rated of higher quality by listeners in noise [19].
Until recently, the intelligibility of synthetic voices was mostly assessed on normal hearing (NH) listeners.A handful of studies investigated the intelligibility of synthetic speech to hearing-impaired (HI) listeners.In the early 1990s, one of the earliest studies of this kind suggested that the intelligibility of synthetic speech varied, but some synthesizers could reach similar intelligibility of natural speech for HI listeners in quiet environments [20,21].A more recent study [22] suggested that the degree of hearing loss at high frequencies (>8 kHz) for listeners was negatively correlated with synthetic speech's intelligibility.There is also evidence that, for listeners with cochlear implants, synthetic speech with a slow speaking rate was significantly more intelligible than that with a normal rate, which was understandably more intelligible than that with a fast rate [23].After synthesis techniques rapidly advanced for over a decade, the intelligibility of synthetic speech by the most recent synthesizers has hardly been reported.It is particularly meaningful to know whether the improvement in the quality and naturalness of synthetic speech seen in recent years could lead to intelligibility gain in non-ideal conditions for both NH and HI listeners.
This study aimed to systematically evaluate the intelligibility of synthetic voices generated by modern commercial speech synthesizers in a series of noise conditions for NH and HI listeners.Six acoustic properties, including speaking rate, F0, harmonic-to-noise ratio, vowel-to-consonant duration ratio, spectral tilt, and vowel space, were measured for the voices and further analyzed to discover and understand how they affected the intelligibility of synthetic speech for the two cohorts of listeners.The main findings and implications were further discussed.

Speech Synthesizers
Three state-of-the-art commercial neural speech synthesizers were chosen to generate synthetic voices to be evaluated.The Amazon Polly neural text-to-speech (TTS) system first converts sequences of phonemes to spectrograms, chooses the spectrograms in which the energy is distributed more in human-sensitive frequencies, and then translates these spectrograms to sound signals by a neural vocoder [24].The Azure neural TTS voices are synthesized by a modification of the FastSpeech2 model, in which the acoustic model produces spectrograms based on pitch and duration predictors, and the vocoder converts the spectrograms to signals [25,26].The Google Cloud TTS system implements the WaveNet2 model and directly produces speech one signal at a time by a convolutional neural network [27,28].
Speech was generated in both male and female voices, but the male voices were chosen for the Google and Microsoft Azure synthesizers because they were predicted by an intelligibility model (see Section 2.4) to be more intelligible than their female counterparts in the same noise environments.Amazon Polly only supported a female voice in Mandarin.For comparison, a female voice was also recorded to balance the sex distribution in the four voices, as further detailed in Section 2.2.Table 1 lists information about these four voices.

Speech Materials
The corpus used for generating the synthetic speech was the Mandarin Speech Perception (MSP) test materials [29].This corpus contains a total of 100 sentences, and each sentence consists of seven monosyllabic characters.The sentences have sensible meanings, and the words used are common in everyday life.Hence, they have moderate predictability.The entire corpus is phonemically balanced.The distribution of phonemes and tones in this corpus is comparable to that of commonly used Mandarin Chinese words [29].
Each of the three commercial speech synthesizers was used to generate the synthetic version of the 100 MSP sentences, resulting in three sets of MSP sentences.All the speech signals were synthesized and saved in mono WAV format, with a sample size of 16 bits and a sampling rate of 16 kHz.To set a baseline using natural human voice, a female native Chinese speaker recorded the 100 sentences in a sound-treated audio booth using an AKG C520 condenser microphone.During recording, the microphone was attached to the speaker's head, and the head of the microphone was placed 2 cm away from the corner of the speaker's mouth.The audio signals captured by the microphone were further preamplified by an RME Fireface UCX II audio interface before being saved in WAV format with a sampling rate of 44.1 kHz.When generating stimuli for evaluation, the recordings were downsampled to 16 kHz; the intensity in root-mean-square of all WAV files, including the synthetic sounds, was normalized to 0.01 to avoid clipping during writing the files to the hard drive.

Noise Maskers and Stimuli
Two types of noise were chosen to mix with speech utterances: speech-shaped noise (SSN) and speech-modulated noise (SMN).The SSN was generated by filtering white noise using the average vocal tract effect of 120 randomly selected utterances from the four sets of sounds.A 100th-order linear predictive coding was used to estimate the vocal tract effect.The SSN consequently shared the long-term average spectrum of the combined corpora, which maximized the masking effect on the target speech in the frequency domain.To generate SMN, the speech envelope was first extracted and smoothed by convolving a 10-min utterance with a 7.2-ms-long pulse train.Then, the SMN became the SSN scaled by the speech envelope.In the time domain, while SSN is temporally stationary, the waveform of SMN resembles that of an actual speech signal with temporal modulations, as illustrated in the left panel of Figure 1.Despite their identical spectra as shown in the right panel of Figure 1, SSN and SMN are known to have different masking effects on the target speech even under the same speech-to-noise ratios (SNRs).The temporal envelopes of the noise maskers are also displayed in red, offset from the waveforms for better illustration.
Target speech signals were mixed with each noise masker at three SNRs as stimuli.As the two maskers vary in their masking effect as mentioned above, the chosen SNRs were −9, −4, and 1 dB for SSN and −12, −7, and −2 dB for SMN, determined empirically and referred to as "low", "mid", and "high" hereafter.The chosen three SNRs were expected to lead to listener intelligibility of approximately 25%, 50%, and 75%, respectively, in each masker [4,30].

Model Predictions
The intelligibility of the voices in the noise conditions was estimated using an object intelligibility model-Distortion-Weighted Glimpse Proportion (DWGP, [31])-as the first approximation.Target speech and its masking counterpart are taken as the inputs for DWGP to generate auditory spectro-temporal (S-T) representations for the signals.Auditory analyses are performed by the model to determine the number of S-T regions in the speech signal with a local SNR above a 3-dB threshold, known as "glimpses", at 34 frequencies that simulate auditory filtering.For each frequency band, the similarity between the clean target speech and speech-plus-noise mixture is also measured.The final predictive index is a logarithmically compressed linear combination of the glimpse quantities weighted by the similarity scores in their frequency bands.From 0 to 1, greater DWGP scores predict better intelligibility.Figure 2 displays the DWGP score of the estimated intelligibility as a function of SNR for the four voices in the chosen noise maskers.In Figure 2, each data point is an average of DWGP scores from the 100 MSP sentences in the same condition.The model predicted that "Amazon" performed consistently better than the other three voices at all SNRs tested in both SSN and SMN."Google" and "Microsoft" were predicted to have similar intelligibility; "baseline", i.e., the natural voice, was the least intelligible.

Participants
Ten native Mandarin Chinese speakers (6 females, 4 males, average age = 23.5 years with s.d.= 2.3 years) with self-reported NH were recruited.All the participants were undergraduate or graduate students at the University of Illinois at Urbana-Champaign.The recruitment and testing methods were approved by the Institutional Review Board of the university.All the participants were compensated financially or via course credit for their time.A post-experiment hearing screening was conducted on all the participants using standard clinical audiometric procedures with an audiometer (Interacoustics AS608e).Their hearing levels (HL) were measured as pure-tone average (PTA6) over the six octave frequencies (250, 500, 1000, 2000, 4000, and 8000 Hz).The results showed that one female participant had subtle hearing loss with a PTA6 above 20 dB HL, leading to her data being excluded from further analysis.
Eight native Mandarin Chinese speakers (5 females, 3 males, average age = 57.6 years with s.d.= 6.4 years) with symmetrical sensorineural hearing loss also participated in this experiment in Jiangsu, China.Of the eight participants, four were daily hearing aid users.Their HLs were measured using the same equipment and procedure as for the NH cohort.The PTA6 across the eight symmetrical cases was 36 dB HL for both the left and right ears, with an interaural difference below 3 dB.The mean HLs in PTA6 for the HI cohort are presented in Figure 3.

Procedure
A pool of 2400 stimuli (4 voices × 2 noise maskers × 3 SNRs × 100 sentences) was created.In each stimulus, the clean speech was buffered with 300-ms silences at the beginning and the end before mixing with the noise.This allowed listeners' hearing to adapt to the background noise and reduce the effect of non-simultaneous masking [32].The actual SNR was calculated only from where the target speech was present.
Given the 4 voices, 2 noise maskers, and 3 SNRs, this design led to a total of 24 conditions.In each condition, a listener heard 4 random sentences, resulting in 96 unique sentences screened throughout the test.As some sentences may be more intrinsically intelligible than others, sentences were randomly drawn without placement from the pool for each condition, in order to minimize such an effect.The sampling was also validated to make sure none of the 96 sentences was heard twice by the same listener in different conditions.During the test, sentences were blocked by noise-SNR conditions.Within each block, the order of the sentences was randomized; the order of the noise-SNR blocks was also randomized.
For NH listeners, the experiments took place in the same sound-treated audio booth where the natural voice was recorded, as described in Section 2.2.The stimuli were presented to listeners over a pair of Sennheiser HD 660S open-back headphones, preamplified by the same audio interface used in recording.The presentation level of speech was calibrated and fixed to approximately 69 dB SPL-the normal conversational level [33]; the noise level was then adjusted to meet the desired SNRs.Stimulus playback was administered by a computer program.After a stimulus was played, participants were instructed to type what they heard from the sentence in Chinese characters using a computer keyboard.Listeners were not allowed to replay any stimulus.Responses were recorded and saved automatically as the experiment progressed.Participants were able to complete the test within 40 min.
Although the HI listeners were recruited at a different site, the experiments were conducted in an audio booth that had similar acoustic specifications as the one for NH listeners.All the experiment materials, design, setup, and equipment strictly followed those for NH listeners, except for the following.All HI listeners wearing hearing aids were instructed to remove their hearing devices and tested with bare ears.Instead of using a fixed presentation level for target speech, all participants in this group were allowed to adjust the speech intensity to a comfortable level (average level = 76.1 dB SPL with s.d.= 5.7 dB) before the experiment started; they could not adjust the level during the test.As the participants in this group had a significantly greater age than the NH group, and some participants were unable to use computers; therefore, the participants responded to stimuli by orally repeating what they heard after each sentence, which was recorded simultaneously by a digital audio recorder.
The intelligibility of the four voices was measured as listener character recognition rate (CRR) in percentages.As Chinese characters have no predictable pattern to their corresponding pronunciations, it is possible that a participant will hear the sound correctly but type a character with the same pronunciation that mismatches the meaning.Thus, both responses and keys were converted to Hanyu Pinyin (the phonetic symbols for Chinese characters) before comparing, in order to take homophones into account.The correctness of the lexical tones in Mandarin Chinese was not taken into account, so the syllables with the same phonemes as the answer but different tones were still considered to be correct.Responses of NH listeners were scored automatically using a computer script, while a manual transcribing and scoring procedure had to be used for HI listeners due to their responses being saved in audio.

Results
Figures 4 and 5 show the average CRRs of participants for the four voices in different noise conditions for NH and HI listeners.For NH listeners, in SSN, "baseline" appears to be the least intelligible at the low and mid SNRs, with CRRs of 4% and 38%.This is broadly consistent with DWGP predictions in Section 2.4."Amazon", on the other hand, leads to noticeably higher intelligibility (38%) than "Microsoft" (13%) and "Google" (8%) at low SNR."Amazon" (85%) is also more intelligible than "Microsoft" (55%) at the mid SNR, but it no longer outperforms "Google" (86%).At the high SNR, all voices lead to CRRs of no lower than 83%, with "Microsoft" being the worst (83%) and "Google" being the best (98%).In SMN, which has a much stronger fluctuation in amplitude than SSN, "Microsoft" is seen to be the least intelligible instead of "baseline" at all SNRs.The advantage of "Amazon" (54%) over other voices ("Google" (47%) and "baseline" (33%)) at the low SNR is less evident than in SSN; the same pattern can also be observed at the mid SNR.

CRR (%)
CRR (%) For HI listeners, the intelligibility of all voices is unsurprisingly lower than that of their NH counterpart in all conditions, with average decreases of 37.4 percentage points (ppts) in SSN and 38.3 ppts in SMN across SNRs and voices.The variance in performance also tends to be greater in several conditions as indicated by the extended error bars, as exhibited in Figure 5. Interestingly, HI listeners did not benefit from "Amazon" or "Google" as much as their NH counterparts did at the low and mid SNRs in both noise maskers.Compared to "baseline" in SSN, the CRR gain of "Amazon" is merely 2 and −1 ppts for HI listeners at the two lower SNRs, respectively, down from 34 and 47 ppt for the NH cohort in the corresponding conditions.The advantage of synthetic voices over "baseline" further diminishes, with the largest intelligibility gain of only 8 ppts achieved by "Google" at the mid SNR.Listener performance in CRR of both NH and HI groups was further analyzed together using a multi-level linear mixed effects model (LMM) to account for the within-group and between-group variability.The LMM was implemented using the "lme4" package [34] in R [35], with CRR as the dependent variable and subject variability as the random effect.The four independent variables-group ("NH" and "HI"), noise type ("SSN" and "SMN"), SNR ("low", "mid", and "high") and voice ("Amazon", "Google", "Microsoft", and "baseline")and their interactions were included as fixed effects.Since all independent variables were categorical, a dummy coding scheme was employed, with group: "NH", noise type: "SSN", SNR: "low", and voice: "baseline" being the reference level in each variable.
A series of linear models was built using a step-wise sequential approach with a forward direction, where the independent variables and their interactions were added incrementally, starting with only the random effect.Table 2 shows the models whose fitting significantly improved over previous models.The results suggested that all independent variables (group, noise, SNR, and voice) had a significant impact on listeners' CRR performance [∀χ 2 ≥ 10.399, ∀p < 0.01].Most of the two-way and three-way interactions between the independent variables were also significant [χ 2 ≥ 9.464, p < 0.01], and so was their four-way interaction [χ 2 (50) = 3.213, p < 0.001].The final LMM, therefore, included 12 terms, as shown in Table .2.

Table 2.
Step-wise build-up of the LMM.The models were assessed by the Akaike information criterion (AIC), Bayesian information criterion (BIC), log-likelihood (LogLik), and the likelihood ratio with the degrees of freedom (χ 2 (d f )) and significance level (SL ***: p < 0.001; **: p < 0.01).Only significant factors and interactions are listed.The significant four-way interaction, i.e., group × noise type × SNR × voice, indicates that the voices varied in intelligibility at different SNRs in the two noise maskers between NH and HI listeners.Contrasts were used to break down this interaction in order to gain better insights into the data.All the significant contrasts and their parameters for this interaction are listed in Table 3 and 4, respectively.Contrast 1 shows that the intelligibility gain from "Amazon" over "baseline" was different at the mid SNR in the two noise maskers for NH and HI listeners.For "Microsoft", while it was more intelligible than "baseline" at the low SNR in SSN, its intelligibility was lower than "baseline" at the same SNR in SMN for NH cohorts; the difference in intelligibility between the two voices was, however, less sizable for HI listeners.A similar relationship was also observed at the mid SNR in the two maskers for "Microsoft", but the overall intelligibility level of this voice was higher than that at the lower SNR, as suggested by Contrasts 2 and 3. Contrast 4 indicates a similar result for "Google" at the mid SNR in the two noise maskers for the two cohorts of listeners as for Contrast 1 for "Amazon".Post hoc pairwise comparisons with the Bonferroni adjustment conformed with the visual impressions received from Figures 4 and 5.In the temporarily-stationary noise (i.e., SSN) , at the low and mid SNRs, "Amazon" is substantially more intelligible than "baseline" and "Microsoft" for NH [∀p < 0.001]."Google" is also more intelligible than "baseline" and "Microsoft" for NH [∀p < 0.001] at the mid SNR.In the fluctuating noise (i.e., SMN), at the low SNR, "Microsoft" is the least intelligible voice compared to other voices [∀p < 0.001] for NH, but not at the mid SNR, where its intelligibility is comparable to "baseline" [p = 1.000] but worse than "Google" [p < 0.01] and "Amazon" [p < 0.001].In both noise maskers, all the voices are similarly intelligible at the high SNR [∀p = 1.000] for NH, where intelligibility reaches the ceiling.For HI, listener performance shows that there are no voices significantly more intelligible than the others at all SNRs [∀p = 1.000].

Acoustic Analyses
The acoustics of the speech signals were analyzed to explain the listeners' different performances when perceiving the four voices.Speaking rate, fundamental frequencies, harmonic-to-noise ratio, vowel-consonant duration ratio, spectral tilt, and vowel space were analyzed for sentences in each voice.The Montreal Forced Aligner [36] was used to generate phoneme-and word-level speech segments from the WAV files, followed by further manual checking and adjustment of the segment boundaries.
-Speaking rate.The speaking rate was measured as the count of spoken syllables per second.As each sentence in this corpus consists of seven words (syllables), the speaking rate for a sentence was calculated by dividing the duration of speech in seconds by seven.Measurements for the four voices are shown in Figure 6a.Pairwise comparisons using the Bonferroni method suggest that the voices significantly differ in speaking rate [∀p < 0.001], with the difference between "baseline" and "Amazon" being the smallest [p < 0.05].-Fundamental frequency (F0).F0 for each sentence was calculated as the mean F0 over all vowels in the sentence.As a tonal language, the F0 for a syllable in Mandarin can fluctuate greatly and take a sharp upturn, downturn, or both.In order to take tone differences into account, F0s were measured at the beginning, middle, and end of each vowel; the F0 for the vowel was calculated as the mean of F0s at these three points.Since every sentence contains seven monosyllabic vowels, the F0s of the vowels were further averaged and computed as the F0 for the sentence.As shown in Figure 6b, all the voices differ from each other in F0 [∀p < 0.001].Since "baseline" and "Amazon" are female voices, their mean F0s are significantly higher than those of "Microsoft" and "Google" [∀p < 0.001], which are male voices.-Harmonic-to-noise ratio (HNR).HNR is the ratio of periodic to non-periodic components in a voiced segment of speech in decibels (dB).It is usually used to show the hoarseness and breathiness of a voice; it is also measured as one aspect of voice quality.For each sentence, the HNR was computed as the mean HNR of the seven monosyllabic vowels.The measurements are shown below in Figure 6c.The comparisons suggest that all the voices also differed in HNR [∀p < 0.001].Interestingly, "baseline"-the natural voice-had significantly better HNR than all synthetic voices.-Vowel-consonant (VC) ratio.The VC ratio for each sentence was measured by dividing the total duration of all vowels by the total duration of all consonants in the sentence.The statistics are shown below in Figure 6d.It appears that the VC ratio varies greatly across sentences, as suggested by the expanded confidence intervals.While "Microsoft", "Google", and "Amazon", the three synthetic voices, see no significant difference in VC ratio [∀p > 0.06], "baseline" has a significantly lower VC ratio than "Amazon" [p < 0.01], but not than "Microsoft" [p = 0.074] or "Google" [p = 0.359].-Spectral tilt.Spectral tilt is one way to quantify how rapidly energy decreases as frequency increases in a speech signal.For analysis in this study, it was measured as the decrease in energy per unit frequency (dB/Hz); that is, the energy difference between the first formant (F1) and the second formant (F2) over the frequency difference between F1 and F2 on the long-term average spectrum of the signal.Therefore, it was the slope of the straight line connecting the two data points of F1 and F2 on the spectrum.The flatter the slope is, the more energy there is at high frequencies relative to that at low frequencies.From "baseline" to "Google" in Figure 6d, spectral tilt increases successively with significant changes [∀p < 0.001].Consistent with the visual intuition also, the spectral tilt of "Amazon" is evidently the lowest among the four voices [∀p < 0.001].-Vowel space.The vowel spaces and their visualizations for the voices were calculated and drawn using the R package "phonR" [37].The vowel spaces were calculated by plotting three corner vowels on an F1-F2 chart and finding the smallest polygon that covers the vowels.The three corner vowels are [i], [a], and [u].These three vowels are frequently supplied as corner vowels in vowel space drawing [11,38] and are all present in Mandarin as monophthongs.These graphs are shown in Figure 7.
The vowel space areas (VSAs) of "Amazon", "Microsoft", "baseline", and "Google" are 1,365,952, 1,195,746, 1,096,042, and 981,738, respectively, in descending order.A greater VSA may indicate better speech-motor control and more distinctive articulatory positions for different sounds.To further measure the quality of individual vowels and their overlaps, between-and within-vowel dispersions and their ratios were also measured from the three groups of sound for the four voices.The greater the between-vowel dispersion, the further the different sounds are from each other, hence the less likely they would be confused by the listeners.Meanwhile, a smaller within-vowel dispersion suggests fewer variations for the same vowel, hence a more robust and consistent pronunciation of the vowel.As illustrated in Figure 7 and further confirmed by Table 5, the three vowels in "baseline" were most distinctive from each other (the largest between-vowel dispersion) and were pronounced most consistently (the smallest within-vowel dispersion), followed by "Amazon".The between-withinvowel dispersion ratio, on the other hand, showed that "Google" had the lowest vowel quality, which is in line with its smallest VSA.
Two general linear models (GLMs) were fitted on the NH and HI data separately to inspect how well speaking rate, F0, HNR, VC ratio, and spectral tilt as independent variables predicted listener performance in the four voices.The measurements of vowel spaces were not included because, instead of at the sentence level, they were derived at the corpus level for each voice.The values of each predictor were centered to have a mean of zero, further scaled by their standard deviation.For NH listeners, the predictions made by the GLM had a significant correlation with the intelligibility in CRR [R 2 = 0.130, F(210) = 6.299, p < 0.001].The linear coefficients of the predictors, β, and their SLs displayed in Table 6 suggest that speaking rate, F0, and spectral tilt played significant roles in estimating the intelligibility of the voices, but including HNRs and VC ratios did not significantly improve the model performance.For HI listeners, the chosen predictors together, however, were unable to make predictions significantly correlated with the CRR [R 2 = 0.010, F(186) = 1.385, p = 0.232], although including F0, HNR, and spectral tilt, but not others, still significantly improved the model fitting, as indicated by their β values in Table 7.

Discussion
This study examined the intelligibility of speech generated by three commercial synthesizers for NH and HI listeners in different noise conditions.The results were somewhat unexpected but encouraging when compared to listeners' performance in CRR on natural speech produced by a female speaker, as initially predicted by the DWGP model [R 2 = 0.748, p < 0.001].Previously, the intelligibility of natural speech was substantially better than that of synthetic speech in noise for NH listeners [2][3][4].Listener performance in our experiments, however, showed that the natural voice was the least intelligible among the four voices evaluated in many conditions, especially in the stationary masker (speech-shaped noise or SSN), suggesting that, in the last decade, deep learning-based synthetic speech has improved tremendously in terms of its quality [7] and intelligibility over that which previously used concatenation, a vocoder, and statistical models.In the temporally-fluctuating noise masker (speech-modulated noise or SMN), the advantage of the synthetic voices over "baseline" was noticeably reduced at the low and mid SNRs.Speech temporal modulation is known to be important to intelligibility.One possibility is that the interference due to the masker's modulation on the target temporal envelope cancelled some benefits from other aspects (see below) and that synthetic modulation could be more susceptible to noise than that of natural speech."Microsoft" is an example of this, as it appeared to be more intelligible than "baseline" at the low and high SNRs in SSN, but was significantly less intelligible than "baseline" at the low SNR in SMN.Further modulation analysis on the signals may help with understanding this issue better.
For the HI cohort, the correlation between DWGP estimations and listeners' performance [R 2 = 0.616, p < 0.001] was much lower than that for their NH counterpart.This is not surprising because DWGP makes predictions using parameters, e.g., the outer-ear transfer function and the HL, for NH listeners.Notably, HI listeners' speech understanding did not benefit from any of the state-of-the-art synthesizers as NH did.Nevertheless, it is worth remarking that their intelligibility did not suffer from synthetic speech either compared to natural speech.This shows that modern speech synthesizers are already capable of producing speech sounds that may replace the natural human voice without compromising intelligibility, at least for listeners with normal or even mild hearing loss.Since all hearing aid users were tested without their hearing devices, it is unclear if they could achieve better listening performance with their hearing aids on.Often, hearing devices are fitted to a listener's hearing profile in order to maximize their performance.Therefore, there could also be a possibility that synthetic sounds interact with their hearing devices acoustically, e.g., overly boost or attenuate certain frequencies, leading to the processed sounds having mismatching frequencies for the listener, hence reduced intelligibility.
Further acoustic analyses revealed that speaking rate was one of the significant factors that explained the NH listeners' intelligibility in this study.It was suggested that NH listener performance benefited from a lower speaker rate [β = −13.577,p < 0.001] in general, but this was not the case for HI listeners [β = −2.945,p = 0.323].This finding is consistent with what was observed in [23] for NH.With a lower speaker rate, listeners may have more time to process and decode the information conveyed by the sounds, especially when perceiving speech in noise, which usually demands higher cognitive loads [39].This may partly explain the benefit of "Amazon" to listener intelligibility (Figures 4 and 6a).Among the four voices, the lowest speaking rate was approximately 4.12 syllables/s, which is consistent with the normal speaking rate-4.08syllables/s [40]-for native Mandarin Chinese speakers.It appears that the speaking rates tested in this study were too high for HI listeners to benefit from.Cognitive decline associated with aging, e.g., [41] of the HI group (average age = 57.6 yrs) might be another factor contributing to their reduced acuity in speech perception.F0 alone was not seen to be correlated with listener intelligibility for NH [R 2 < 0.001, p = 0.901] or HI [R 2 < 0.001, p = 0.880], but including it in the GLMs improved the model prediction.There is no direct evidence showing that solo F0 changes have a significant impact on intelligibility for NH [42,43] or HI listeners [44].For Lombard speech, which is often produced when a listener speaks in noise and is understood to be more intelligible than speech produced in quiet, increased F0 was indeed observed together with many other acoustic changes, such as reduced spectral tilt and elongated vowel sounds [45], in relevant studies.Therefore, it is possible that F0's contribution to intelligibility is only manifested in the presence of other prominent acoustic correlates.Meanwhile, it is well-documented that solely reducing spectral tilt can lead to improved speech intelligibility for NH listeners, e.g., [17,46], which is also observed in the current study [R 2 < 0.034, p < 0.01], especially for "Amazon".In the GLM, it also indicates that intelligibility is proportional to spectral tilt [β = 17.466, p < 0.05].Despite the significant effect in the GLM model for HI, like F0, spectral tilt was not correlated with intelligibility for HI [R 2 = 0.002, p = 0.551].In SSN and SMN, which have a similar spectrum as speech signals, voices with reduced spectral tilt often stand a better chance of escaping from masking since more energy is located at high frequencies of speech.NH listeners can exploit the unmasking at high frequencies for better intelligibility.As shown in Figure 3, the majority of HI participants in this study had more severe hearing loss at high frequencies, which led to them not being able to take as much of an advantage as NH listeners did.
VSA and vowel quality were not found to have an evident correlation with intelligibility in this study.In general, "Google" was the second most intelligible voice after "Amazon" across all conditions and sometimes appeared to be even better than "Amazon" (Figures 4 and 5).However, the VSA, between-vowel dispersion, and between-withinvowel dispersion ratio of "Google" were the smallest among the four voices.The findings in terms of the relationship between VSA and its pertinent measures and intelligibility are mixed.While expanded VSA was reported in some studies on clear speech, e.g., [10,47], and on Lombard speech compared to normal speech, e.g., [10], reduced VSA was also found in Lombard speech in other studies [48,49].Therefore, the current findings lean towards the conclusion that VSA and vowel quality do not have a direct impact on the intelligibility of synthetic speech, despite the plausible speculation that little overlap between sounds and consistent pronunciation may reduce confusion between sounds.
This study has a few limitations.First, only six acoustic features were analyzed.There are several other properties that also correlate with intelligibility, including emphasis and prosody [50,51].Two acoustic measures that can indicate emphasis are F0 contour and maximum intensity of the signals [52], as they give strong cues to listeners of what to expect in the speech.On the same note, the DWGP model predicted intelligibility well, implying that there could be other acoustic aspects accounted for by the DWGP model but not analyzed in this study.For example, no evidence was found to explain the intelligibility ranking for "Google", which led to comparable intelligibility as "Amazon" in several conditions but considerably differed in acoustic measurements analyzed in this study.Second, only a limited number of synthetic voices was evaluated.Including more voices could exhibit different intelligibility and analytical results.Third, the number of participants in both groups was relatively small, which might lead to weak statistical power in the analyses.

Conclusions
In this study, the intelligibility of synthetic speech generated by three modern commercial synthesizers was evaluated along with a natural female voice for both NH and HI listeners in noise conditions.For NH listeners, some synthetic voices were significantly more intelligible than natural speech, especially in more adverse conditions, showing that synthetic speech has improved tremendously in both naturalness and intelligibility over the last decade.Subsequent acoustic analyses on the voices revealed that synthetic speech with a slower speaking rate and reduced spectral tilt tends to be more intelligible than others for NH listeners.HI listeners, however, benefited little from those acoustic changes due to their hearing loss at high frequencies and potential cognitive decline with aging.These findings may provide a guideline for future customizable speech synthesis techniques that aim to generate voices according to listeners' hearing profiles.

Figure 1 .
Figure 1.Examples of SSN (top left) and SMN waveforms (bottom left) and their spectra (right).The temporal envelopes of the noise maskers are also displayed in red, offset from the waveforms for better illustration.

Figure 2 .
Figure 2. Predicted intelligibility scores from DWGP for all four voices in SSN (left) and SMN (right).

Figure 3 .
Figure 3. Average two-ear HLs (dB) of the eight HI listeners with symmetrical hearing loss at six octave frequencies.The solid circles and shades show the mean levels and the corresponding 95% confidence intervals, respectively.

Figure 4 .
Figure 4. Mean NH intelligibility of synthetic speech generated by three commercial synthesizers and natural speech at three SNRs in SSN (left) and SMN (right).Error bars indicate 95% confidence intervals.Numbers above error bars show the actual CRR readings in percentage for conditions.

Figure 5 .
Figure 5. Mean HI intelligibility of synthetic speech generated by three commercial synthesizers and natural speech at three SNRs in SSN (left) and SMN (right).Error bars indicate 95% confidence intervals.Numbers above error bars show the actual CRR readings in percentage for conditions.

Figure 6 .
Figure 6.Speaking rate (a), F0 (b), HNR (c), VC ratio (d), and spectral tilt (e) of the natural and three synthetic voices.Error bars show 95% confidence intervals of the means.

Table 3 .
Significant contrasts for the four-way interaction in the LMM model.

Table 5 .
Vowel quality in the four voices measured for [a], [i], and [u] using between-vowel dispersion ξ b , within-vowel dispersion ξ w , and their ratio ξ b ξ w .