Perceptual debugging of speech synthesis
Time: Thu 2017-03-16 15.00 - 17.00
Lecturer: Gustav Eje Henter, Digital Content and Media Sciences Research Division, NII, Japan
Location: Fantum, Lindstedsvägen 24, 5th floor
Speech synthesis is not yet completely natural. This talk discusses "perceptual debugging": fault-finding techniques that identify which synthesiser design decisions that are responsible for perceptual degradations, to guide future synthesis research. We cover both methods of dissecting existing synthesisers (for example, isolating the improvements behind the success of DNN-based TTS) as well as for identifying the most important naturalness bottlenecks in state-of-the-art synthesisers. The second topic is particularly innovative, as it involves listening to hypothetical future synthesisers beyond our current capabilities, whose output we simulate using repeated readings of the same text. We find that several factors (not just the switch to DNNs) are responsible for recent improvements in TTS technology, and that statistical parametric speech synthesis quality remains hampered both by our independence assumptions, and (ultimately) by our decision to generate the mean speech as output. We round off by showing how the use of repeated speech recordings may be extended to an entire research programme, answering questions previously considered unanswerable.