The perception of German word-final devoicing in natural and synthesized speech

Aleese Block, visiting Oslo from UC Davis, talks about the production and perception of word-final devoicing in German across text-to-speech and naturally-produced utterances.

This study explores the production and perception of word-final devoicing in German across text-to-speech (from technology used in common voice-AI “smart” speaker devices) and naturally-produced utterances. First, the phonetic realization of word-final devoicing in German across the TTS and naturally produced word productions was compared. Acoustic analyses reveal that the cues to word-final devoicing in German were distinct across the speech types. Naturally-produced words with phonologically voiced codas contained partial voicing, as well as longer vowels than words with voiceless codas. However, these distinctions were not present in TTS speech. Next, we had German listeners complete a forced-choice identification task, in which they heard the words and made coda consonant categorizations, in order to examine the intelligibility consequences of the word-final devoicing patterns across speech types. Accuracy was higher for the naturally-produced, than the synthetic speech. Moreover, listeners systematically misidentified voiced codas as voiceless in TTS speech. Overall, this study extends previous literature on speech intelligibility at the intersection of speech synthesis and contrast neutralization. TTS voices tend to neutralize salient phonetic cues present in natural speech. Subsequently, listeners are less able to identify phonological distinctions in TTS speech. We also discuss how investigating which cues are more salient in natural speech can be beneficial in synthetic speech generation to make them not only more natural, but also easier to perceive.

This is a hybrid event, with the presenter present in HW536. It will also be streamed on Zoom.

