Skip to main content
To KTH's start page To KTH's start page

CAPTivating - Comparative Analysis of Public Speaking with Text-to-speech Synthesis

Captivating an audience means attracting and holding the listeners' attention by being very interesting, exciting or pleasant. The proposed method aims to use comparative perceptual experiments with spontaneous speech synthesis to be able to systematically vary various speech features and measure the direct and combined perceptual impact.

In the project we will select ecologically valid text materials, such as publicly available podcasts and collections of formal speeches. We will then use our spontaneous neural TTS to synthesise the selected materials according to particular hypotheses on how different aspects of speech delivery influence listeners’ perception of the speaker. This will involve different breathing strategies, intonational variability, changing speaking rate, inserting a variety of fillers, discourse markers and other spontaneous speech phenomena. Subsequently, the stimuli will be re-created with alternative speaker characteristics and also voice quality characteristics, such as vocal tension and creak. We will conduct our listening tests using crowd-sourcing through Prolific Academic and WebMushra. We will also conduct on-site experiments involving multimodal recordings including speech, video, depth and facial action units, heat video, gaze and heart rate, to gain a deeper understanding of the listener’s impression and affective state.

This project is part of our recent years' efforts in developing spontaneous conversational speech synthesis, for examples please visit .



Riksbankens Jublileumsfond


2020 → 2025