1. Introduction
Present-day, people interact with synthesized speech in a variety of situations. For instance, the prevalence of voice-activated artificially intelligent devices, which use text-to-speech (TTS) generation to communicate with users, means that language comprehension of synthetic speech can be an everyday event for many people. Intelligibility of TTS speech is both a practical and theoretical issue. From the practical perspective, as TTS methods evolve, understanding the ways in which speech comprehension for synthetic speech deviates from natural speech can contribute to speech technology advancements (e.g.,
Duffy and Pisoni 1992). Theoretically, comparing how speech intelligibility differs across naturally produced and TTS voices can inform our understanding of what properties of the signal allow listeners to better understand a speaker’s intended message.
The current study asks whether the production and perception of phonological variation differs across naturally produced and TTS speech. We focus on word-final devoicing. Word-final devoicing is a cross-linguistically common phonological process, observed in languages such as German and Polish, commonly leading to contrast neutralization of voiced and voiceless obstruents in coda position of words (
Dinnsen and Charles-Luce 1984;
Port and O’Dell 1985;
Winter and Röttger 2011). However, speakers often produce secondary phonetic features that differentiate word-final voiced and voiceless stops, such as differences in preceding vowel duration, consonant closure duration, and burst duration across words with underlying voiced and voiceless obstruent codas (
Fourakis and Iverson 1984;
Charles-Luce 1985). Thus, in natural speech, speakers appear to show incomplete neutralization of a phonological process that might lead to contrast reduction. Moreover, listeners are highly sensitive to these secondary acoustic properties in identifying the underlying phonological contrast: for instance, German listeners are able to accurately categorize word-final voiced and voiceless obstruents in natural produced speech (
Janker and Piroth 1999).
The current study asks two questions: First, how does the acoustic realization of word-final devoicing differ across naturally produced and (concatenative) TTS speech? Specifically, we ask whether the secondary phonetic cues to word-final obstruent voicing observed in naturally produced speech are present in TTS speech, such as that used in common household smart speakers, such as Amazon’s Echo. We compare several acoustic features in the realization of word-final devoicing in German across TTS and naturally produced word productions in order to quantify how they differ. Second, we explore the implications of any phonetic differences across natural and TTS speech for intelligibility. Specifically, is there a difference in listeners’ ability to identify the contrast between word-final voiced and voiceless codas across natural and TTS speech? We compare native listeners’ categorizations of words with final voiced codas in German across natural and TTS productions. Understanding how phonological variation differs across natural and synthetic speech can outline further the gap in generating the most naturalistic TTS. Moreover, from a practical perspective, outlining how the properties in synthetic speech lead to reductions in intelligibility, relative to natural speech, can be used to improve TTS in the future.
1.1. German Word-Final Devoicing in Production
Phonological neutralization is when a phonological distinction disappears in a specific context in a language. There are many different phonological processes that can lead to neutralization, resulting in a loss of lexical contrast in certain contexts. A cross-linguistically well-documented phonological process that leads to contrast-neutralization is the devoicing of historically voiced obstruents in word-final position. This process is exemplified, and well-studied, in German; for instance, the minimal pairs Bund (“league”) and bunt (“bright”) are both produced as [bunt]. In other morphological alternations, these underlying voiced stops surface (e.g., Bünde is pronounced [by:ndə], retaining full voicing), indicating that German speakers have paradigmatic evidence to the phonological form of the neutralized words.
However, there is much prior work investigating whether this process is categorical, resulting in full neutralization of minimal pair contrasts, or more gradient where phonetic cues to the underlying voiced-voiceless distinct are still present in the speech signal. Numerous research papers have shown that neutralization of word-final voiced stops is indeed ‘incomplete’ in German production (
Dinnsen and Charles-Luce 1984;
Port and O’Dell 1985;
Winter and Röttger 2011;
Kharlamov 2012). For example,
Port and O’Dell (
1985) examine word-final neutralization of German stop voicing and its realization in natural speech as a “semicontrast” (p. 455). They investigate the production of word-final /b/-/p/, /d/-/t/, and /g/-/k/ in real words and find that word final voiceless stops and word final devoiced stops are not acoustically identical. They observe that devoiced stops showed (1) longer duration of the preceding vowel, (2) longer consonant closure duration, (3) shorter burst duration and (4) some voicing in the coda stop closure, relative to underlyingly voiceless stops. Similarly,
Fourakis and Iverson (
1984) also found a significant difference in vowel duration and consonant duration, with both being longer in devoiced stops in comparison to voiceless stops.
Charles-Luce (
1985) also found preceding vowels in devoiced stop contexts to be significantly longer than those in voiceless contexts. Overall, it appears that there are multiple phonetic differences that distinguish underlying voiced and voiceless stops in German even after phonological devoicing, with preceding vowel duration appearing to be the most consistent acoustic difference.
Thus, prior work has established that the phonological devoicing of stops in German is ‘incomplete’. Incomplete neutralization of word-final devoicing has also been found cross-linguistically. For example,
Warner et al. (
2004) found that in Dutch, a language closely related to German, similar sub-phonemic differences in duration across multiple dimensions in production (i.e., vowel duration, burst) can be shown. Additionally for Dutch,
Ernestus and Baayen (
2006) have found incomplete neutralization to bear some functionality as a cue to past-tense formation. In a study on incomplete neutralization in Polish,
Slowiaczek and Dinnsen (
1985) found vowel duration to be longer in vowels preceding a devoiced stop. Devoiced stops show a longer voicing into closure in Catalan (
Charles-Luce and Dinnsen 1987). This suggests that the features of incomplete neutralization found in German are also a common feature of natural speech in other languages.
The phonological and phonetic variation found in natural language potentially presents a challenge to generating intelligible and naturalistic text-to-speech (TTS) utterances. Though modern speech synthesis now generates highly naturalistic speech, there are still many differences in the acoustic properties of speech generated using TTS and that produced naturally by people. In recent years, people are interacting with digital devices more and more on a daily basis (
Ammari et al. 2019), and engineers have worked to improve synthesis techniques to generate speech that more closely resembles naturally produced speech. Because the methods used in TTS synthesis are rapidly evolving (
van den Oord et al. 2016), the need to explore how the subphonemic phonological variations are represented in TTS-generated speech has grown. Understanding the extent to which fine phonetic detail is both present in speech production and exploited by listeners in speech perception is an important aspect of speech science (e.g.,
Hawkins 2003). The question of how such fine details are produced and perceived in TTS is an important aspect of addressing this area. Our first research question is whether concatenative TTS contains the same subphonemic cues for incomplete neutralization in German that is observed in naturally produced speech.
1.2. The Perception of Word-Final Stops in German
The presence of subphonemic cues to the voicing contrast in word final position in German, and other languages, suggests that the phonological process does not result in complete neutralization of the lexical contrast. Indeed, there is much work examining whether listeners can still identify the underlying phonological contrasts from the presence of distinct phonetic cues. For example,
Port et al. (
1981), and also
Port and O’Dell (
1985), found that listeners are able to identify the voicing category of word final stops above chance for real German words.
Kleber et al. (
2010) presented German listeners stimuli manipulated along a continua of vowel duration and stop closure duration from values taken from word-final voiced stops to voiceless stops. They found that while listeners can distinguish categorically between the continua endpoints, their judgments shift gradiently as the phonetic differences change; this indicates that listeners show gradient sensitivity to subphonemic cues.
As mentioned earlier, an open question is how phonological variation should be implemented and perceived in TTS. The goal of synthetic speech is to make natural and intelligible speech. Thus, having TTS that contains phonetic variation across phonological conditions similar to what is found in natural speech might achieve that goal. Prior work has found that the addition of fine phonetic details that are present in naturally produced speech improves perception when present in synthetic speech. For example, listeners have been shown to better identify synthesized speech segments if they contained the contextually appropriate coarticulation (i.e., F2 lowering before /r/ or /z/) (
Hawkins and Slater 1994). This demonstrates the critical role of context-specific variation, a ubiquitous feature of speech, in supporting successful phoneme identification in synthetic speech, similar to what is found in naturally produced speech. In the current study, we extend this previous research exploring how subphonemic variations in synthetic speech influence listener perception and extend it to word-final devoicing in German.
Our second research question is how do subphonemic cues to word-final devoicing in naturally produced speech vs. TTS influence listener perception? While prior studies have investigated the perception of word-final devoicing in German, no prior work, to our knowledge, has investigated this question in synthesized speech. We are interested in specifically comparing natural speech and TTS, as TTS can contain reduced acoustic cues present in natural speech.
Why might the perception of incomplete neutralization differ between TTS and naturally produced speech? One possibility is that the degree of neutralization may differ in synthetic speech; the secondary cues to obstruent voicing contrasts that remain in naturally produced speech may be present to a smaller degree or not at all in TTS-generated speech, or different acoustic cues might be emphasized in comparison to natural speech. For example, if the system incorporates only the rule of word-final devoicing (i.e., concatenating a phonological /t/ without secondary cues), it may not have the ability to generate acoustic differences between [at] and [ad̥]. Alternatively, the system might prioritize only one acoustic cue, such as vowel duration, but ignore others such as closure duration, or f0. This difference on a fine acoustic level would lead to a difference in listener accuracy in identifying neutralized word-final obstruents for TTS speech. It has been shown that fine acoustic details drive differences in perceptual patterns; for example,
Zellou et al. (
2021) found that fine-tuned acoustic differences in TTS methods lead to distinct patterns in listeners’ perception for coarticulation rather than perceived roboticity.
Another possibility is that listeners’ perception of secondary cues remaining after incomplete neutralization is influenced by other aspects of synthesized speech outside of phonetic differences in contrast-specific acoustic patterns. The concatenative nature of some types of TTS can cause noticeable mismatches in f0 and spectral properties at phoneme boundaries (
Ávila et al. 2018); this can cause listeners to perceive speech as more “telegraphic”, and they may not look to adjacent segments for clues to an obstruent’s voicing. In both of these scenarios, there is a possibility that TTS-generated speech is not identified with the same ease or accuracy as naturally produced speech, even when secondary phonetic cues to the contrast, in this case preceding vowel duration, are neutralized.
1.3. Current Study
The current study aims to: (1) compare the phonetic realization of word-final voiceless and devoiced German stops in naturally produced and TTS speech and (2) investigate the phonological categorization of word-final devoiced and voiceless codas in naturally produced and synthesized speech.
Experiment 1 is a phonetic analysis of word-final devoiced and voiceless stops in German across naturally produced and TTS speech, generated from two widely used commercial TTS systems (Apple and Amazon Polly). We measure consonant closure duration and duration of the preceding vowel, the subphonemic features commonly found to vary across words containing phonologically voiced and voiceless codas. In Experiment 2, we employ a phoneme identification task to test listeners’ ability to correctly identify intended voiced and voiceless codas in naturally produced and TTS speech. We test German listeners’ phoneme identifications on two types of stimuli: first, listeners performed the task on unaltered productions, allowing us to see if listeners’ performance on phoneme identifications given the current variations across naturally produced and synthetic speech; second, listeners also made phoneme identifications on items that had been modified to contain neutralized vowel duration cues. Specifically, the change in listener performance can be interpreted as telling of the degree to which listeners rely on durational differences on the vowel to identify the voicing category of word-final consonants in German across speech types.
3. Experiment 2: Perception of Word-Final Voicing
Experiment 1 revealed that the cues to word-final coda voicing in German were distinct across naturally and synthetic speech: specifically, natural speech contained distinct vowel duration patterns across words with phonologically voiced and voiceless codas while TTS speech did not. The naturally produced words also contained only partial devoicing, whereas the TTS words were fully devoiced. Given these differences across naturally produced and TTS lexical items, we designed Experiment 2 to ask whether listeners use these distinctions in naturally produced speech to categorize word-final voiced and voiceless codas. If so, we also ask whether this will cause differences in overall identifiability for voiced codas between the speaker types. Prior work has found that listeners are sensitive to the preceding vowel duration distinctions across phonologically voiced and voiceless codas in Germanic languages (
Port and O’Dell 1985;
Warner et al. 2004). Thus, we predict that listeners will be more accurate in categorizing coda voicing in naturally produced speech than in TTS speech, since the former makes use of this secondary cue.
In a second part of the experiment, we will neutralize vowel duration differences within pair to see if listeners can use cues on the coda consonant itself in identification. Consonant closure durations of voiced consonants have been found to be reliably longer than voiceless consonants (
Stathopoulos and Weismer 1983), but a distinction in this duration was maintained only in TTS speech. While lacking this durational difference, natural speech did contain partial voicing during consonant closure. Therefore, we predict that listeners will still be able to reliably identify the phonological voicing status of coda consonants in natural speech.
3.1. Stimuli
The stimuli consisted of the 24 words produced by the four talkers from Experiment 1. Stimuli were normalized for intensity (60 dB) and naturally produced tokens were down-sampled from 44.1 kHz to 22.05 kHz to match the TTS voices. One set of stimuli remained as they were recorded (apart from intensity and sampling frequency normalization), and one set was manipulated to create an altered stimuli set. For the altered stimuli, vowels were extracted from their original utterance. Vowels were then normalized for pitch (20% decrease linearly across the vowel) and duration (average within-speaker and within-pair duration) using the VocalToolkit plug-in (
Corretge 2012) for Praat (version 6.1.34;
Boersma and Weenink 2020). Normalization of these secondary cues allows us to test if listeners are able to use acoustic cues remaining on the devoiced obstruent to correctly identify it as being underlyingly voiced, in line with previous cue weighting research (
Clayards 2018). After the vowels were normalized, they were spliced back into their CVC contexts.
3.2. Participants and Procedure
Fifty-six (47 female, 9 male, average age = 33.5 years old, SD = 11.1) listeners were recruited to participate remotely. All reported being native speakers of German, and none reported a history of hearing impairment. Before listeners began the experimental blocks, they were presented with a sentence “Ich habe einen Tisch” (“I have a table”) normalized to 60 dB and were instructed to adjust their volume to a comfortable level and leave it there for the duration of the task. Stimuli were presented in two experimental blocks. In Block 1, trials contained the unaltered stimulus items containing original vowel duration and pitch contour. In Block 2, listeners were presented with the altered stimulus items which contained vowels normalized within-speaker and -pair for vowel duration and pitch; pitch was included to omit potential effects it might have on perceived duration (
Brigner 1988). On each trial, listeners were presented with a stimulus item and asked to identify the final segment in the word they heard by selecting it from two options; options were always orthographic depictions of the voicing contrast in German (e.g., “d” and “t” for pad). Order of voiced and voiceless options was randomized across trials. In addition to the experimental trials, a listening comprehension question was also included in between the blocks. For the listening comprehension question, listeners heard a sentence (“Wo ist die Katze”) and were asked to indicate what they heard from a list of three options (“Wo ist die Katze”, “Wo ist die Kuh”, and “Wo ist die Maus”; the order of presentation was randomized). Only participants who correctly answered this question had their data included in the analysis (all participants correctly answered this question).
3.3. Results
Responses were coded for whether listeners identified the underlying coda voicing (=1) or not (=0). Responses were analyzed with a mixed-effects logistic regression model using the glmer() function in the lme4 package in R (version 1.1-26;
Bates et al. 2015). Fixed effects included Speaker Type (TTS, Naturally produced), Coda Voicing (Voiced, Voiceless), Log Word Duration, and Trial Type (Original Duration, Duration-Neutralized). We also included all the possible two- and three-way interactions between Speaker Type, Coda Voicing, and Trial Type. Random effects included by-Listener random intercepts and by-Listener random slopes for Speaker Type and Trial Type (more complex random effects structure led to converge failures). To explore significant interactions, Tukey’s HSD pairwise comparisons were performed within the model, using the emmeans() function in the lsmeans R package (version 1.5.3;
Lenth et al. 2021).
Table 4 provides the summary statistics from the model.
Figure 4 displays participants’ aggregated responses to coda voicing identification for TTS-generated and natural speech across Coda Voicing and Trial Types. First, the model computed a significant main effect for Speaker Type, where they responded less often with the underlying coda voicing for TTS-generated (49%) than with naturally produced (62%) lexical items, on average. Furthermore, there was a main effect for Coda Voicing: listeners responded with the underlying coda voicing less often for voiced codas (41%) than for voiceless codas (71%). The model also revealed a main effect of Trial Type, wherein listeners responded with the underlying coda voicing in unaltered lexical items containing original vowel durations (59%), relative to duration-neutralized tokens (51%). Lastly, there was a main effect of Word Duration, where listeners responded with the underlying coda voicing more when the words were longer overall.
The model also computed a significant interaction between Speaker Type and Trial Type. Tukey’s HSD pairwise comparisons revealed that listeners responded with the underlying coda voicing more in the unaltered trials than in the altered trials for the naturally produced speech (z = 10.86, p < 0.01), but there was no difference between Trial Types for TTS-generated speech (z = −0.18, p = 0.86). Listeners responded with the underlying coda voicing more for the natural speech in both the unaltered trials (z = 12.45, p < 0.01) and altered trials (z = 3.05, p = 0.002), but the difference was smaller in the altered trials.
The model also computed a significant interaction between Speaker Type and Coda Voicing. Listeners responded with the underlying coda voicing more with voiceless codas as voiceless than with voiced codas as voiced for both naturally produced (z = −11.70, p < 0.01) and TTS speech (z = −31.33, p < 0.01). While listeners responded with the underlying voicing more for voiced codas for the naturally produced than for TTS speech (z = 17.43, p < 0.01), there was no difference in identification of voiceless codas between Speaker Types (z = −1.36, p = 0.17). The interaction between Coda Voicing and Trial Type was also significant; listeners responded less often with the underlying coda voicing in voiced codas than with voiceless codas in both unaltered (z = −23.34, p < 0.01) and altered trials (z = −20.09, p < 0.01).
Finally, the model computed a significant three-way interaction between Speaker Type, Trial Type, and Coda Voicing. This interaction is illustrated in
Figure 4. For naturally produced speech (left panel), while listeners responded more often with the underlying coda voicing for voiceless codas than for voiced codas in both unaltered (z = −11.41,
p < 0.01) and altered trials (z = −22.67,
p < 0.01), the difference between Coda Voicing was larger in unaltered trials than in altered trials. Within TTS-generated speech (right panel), listeners responded more with the underlying coda voicing for voiceless codas in both unaltered (z = −21.67,
p < 0.01) and altered trials (z = −22.67,
p < 0.01), and there was no difference for either Coda Voicing between Trial Types.
4. General Discussion
The current study was designed to investigate the perception of word final-obstruent devoicing in German in concatenative TTS-generated and naturally produced speech. We first conducted an acoustic analysis of words produced by two female native speakers of German and generated by two female German TTS voices, to investigate the extent of neutralization in these voices. We found distinct acoustic realizations of incomplete neutralization between voice types. For one, while in natural productions, speakers produced partial devoicing of underlying voiced stops, and TTS words contained full devoicing.
The realization of secondary acoustic cues also varied across voice types. In naturally produced speech, there was a significant difference in duration between vowels before voiced (164 ms) and voiceless (85 ms) stops, which is consistent with previous descriptions of word-final devoicing in German as an example of incomplete neutralization (
Port and O’Dell 1985;
Charles-Luce 1985). Nevertheless, in generated TTS speech, there was no difference in vowel length (87 ms before voiced; 85 ms before voiceless). Given that salient secondary acoustic cues can contribute to the robustness of a contrast for listeners (
Stevens and Keyser 1989), we can assume that a larger difference in the preceding vowel’s duration is helpful for listeners.
Consonant closure duration, however, was distinct across voiced and voiceless codas within TTS speech, but not in naturally produced speech. However, given that the stops are completely devoiced, this is perhaps a less robust cue to word-final voicing in German. While the difference in consonant closure duration may still be informative and useful for listeners, it is not present in all stimuli and may therefore still pose issues for identification of voiced coda obstruents. From our stimuli, it appears that only consonant closure duration in alveolar consonants is available as a distinction in TTS voices. This highlights a fundamental issue in the production of word-final devoicing in TTS speech: salient and informative secondary acoustic cues are being excluded, leading to different phonetic realizations in TTS speech from those in natural speech. Taken together, the production data suggest that while neutralization of word-final stop voicing is incomplete in naturally produced German, it seems to be neutralized in TTS-generated speech, potentially causing difficulties for listeners interacting with devices.
Experiment 2 examined the perceptual consequences of the word-final devoicing patterns across speech types. It consisted of a two-part forced identification task in which listeners heard the words and made coda consonant categorizations. Participants were presented with two types of trials: unaltered and altered. In the unaltered trials, participants were played the original productions of the wordlists with only intensity and sampling rate normalized. The rate at which listeners identified the underlying coda voicing were overall higher for the naturally produced speech than for the TTS voices for both voiced and voiceless codas. This is in line with our predictions given that the naturally produced speech offered two acoustic-phonetic distinctions along two dimensions while TTS speech offered only one. For natural voices, listeners identified the upcoming coda voicing at above chance levels but did so around 30% of times for TTS voices; listeners were reliably misidentifying voiced codas as voiceless rather than simply being unsure. This highlights the overall importance of duration of the preceding vowel as an enhancing cue to contrastive stop voicing as well as a possible overall bias toward identifying codas as voiceless.
In altered trials, vowel duration and pitch were normalized to remove vocalic information as an enhancing cue. Here, we found that the rate of identification of underlyingly voiced codas dropped to chance level in the natural voices and stayed around 30% for the TTS voices, showing that listeners were no longer able to identify voicing in these positions. For the natural voices, identification of underlying coda voicing for both voiced and voiceless codas dropped significantly between unaltered and altered trials, showing that neutralizing the secondary cue of vowel duration was particularly detrimental to perception of coda voicing. Meanwhile, there was no difference in response across trial types for TTS voices, suggesting that the preceding vowel was truly uninformative for listeners before manipulation.
There were a number of limitations to the current study that offer possible avenues for future research. First, our study had a relatively limited scope, looking only at monosyllabic non-words in German concatenative TTS voices. The patterns found in these voices might differ from those generated with neural methods given the distinct methods of voice synthesis. Furthermore, our study utilized voices generated by only two systems: Apple and Amazon. Comparing across more TTS voices and systems is a ripe area for future work. Independently developed speech synthesis systems could produce slightly different voice patterns, and therefore, it would be advantageous to include voices from a wider range of systems in future investigations.
Furthermore, the relatively overall low identification accuracy in the original human trials found here (60%) and in other studies (
Port et al. (
1981) found 60% accuracy in their listening test) of voiced codas raises the question whether cues found in incomplete neutralization are indeed useful in speech perception, albeit this number is still high enough to argue for incomplete rather than full neutralization. It has been argued that incomplete neutralization bears no functional relevance in German, as most lexical items undergoing it would never occur in the same contexts (
Röttger et al. 2011). Of course, the results shown here could be an artifact of the experimental design, forcing participants to choose between two options of voicing. However, they do show that listeners are able to exploit fine phonetic details when faced with the task of disambiguating stops in a neutralizing context.
Overall, this study extends previous literature on speech perception at the intersection of contrast neutralization and synthesized speech. Our data shows that concatenative TTS voices tend to neutralize salient phonetic cues present in natural speech that are used by listeners to disambiguate phonemic contrast in the absence of other contextual information. We also show that investigating which cues are more salient in the perception of natural speech can be beneficial in the process of deciding how to model fine phonetic detail in the architecture of TTS voices to help make them not only more natural but also easier to perceive, which could be especially useful in noisy environments and for listeners with hearing impairments.