Synthesising prosody with insufficient context
Time: Fri 2021-02-19 15.15
Lecturer: Zack Hodari
Prosody is a key component in human communication, it conveys humour, emotion, personality, connotation, as well as simpler concepts like emphasis, contrast, and focus. However, state-of-the-art text-to-speech (TTS) is lacking in prosody, especially for long-form content. While individual sentences can be delivered in a natural manner, the prosody in TTS does not convey additional information. This is due to a lack of contextual information, normally used by humans when planning prosody. Without this contextual information, prosody appears to be unwanted variation, just like background noise. Unfortunately, prosodic context is ill-defined, very broad, low-resource, and expensive to collect. Current approaches that synthesise prosody for isolated sentences (i.e. sentences with no context) are inherently limited. Additionally, they produce average prosody: a boring and flat delivery.
My thesis focuses on capturing and controlling the typically unaccounted-for prosodic variation. I demonstrate that doing so resolves average prosody. I explore interpretable representations for controlling prosody, this allows prosody to be improved even with insufficient
context. Finally, I present a state-of-the-art method using additional context information to control prosody. Incorporating context works towards the goal of achieving appropriate prosody, as opposed to focussing on naturalness exclusively.
Zack Hodari is a PhD candidate supervised by Professor Simon King at the Centre for Speech Technology Research (CSTR) in the University of Edinburgh. He obtained an MSc by research from the University of Edinburgh, working on emotion recognition and emotive speech synthesis. His PhD research focuses on speech synthesis and prosody, specifically on producing multiple prosodic renditions of individual sentences.