Predictive Modeling of Turn-Taking in Spoken Dialogue
Computational Approaches for the Analysis of Turn-Taking in Humans and Spoken Dialogue Systems
Time: Fri 2023-12-08 10.00
Subject area: Computer Science Human-computer Interaction Speech and Music Communication
Doctoral student: Erik Ekstedt , Tal, musik och hörsel, TMH
Opponent: Professor Roger Moore, Department of Computer Science, University of Sheffield, Sheffield, UK
Supervisor: Professor Gabriel Skantze, Tal, musik och hörsel, TMH
Turn-taking in spoken dialogue represents a complex cooperative process wherein participants use verbal and non-verbal cues to coordinate who speaks and who listens, to anticipate speaker transitions, and to produce backchannels (e.g., “mhm”, “uh-huh”) at the right places. This thesis frames turntaking as the modeling of voice activity dynamics of dialogue interlocutors, with a focus on predictive modeling of these dynamics using both text- and audio-based deep learning models. Crucially, the models operate incrementally, estimating the activity dynamics across all potential dialogue states and interlocutors throughout a conversation. The aim is for these models is to increase the responsiveness of Spoken Dialogue Systems (SDS) while minimizing interruption. However, a considerable focus is also put on the analytical capabilities of these models to serve as data-driven, model-based tools for analyzing human conversational patterns in general.
This thesis focuses on the development and analysis of two distinct models of turn-taking: TurnGPT, operating in the verbal domain, and the Voice Activity Projection (VAP) model in the acoustic domain. Trained with general prediction objectives, these models offer versatility beyond turn-taking, enabling novel analyses of spoken dialogue. Utilizing attention and gradientbased techniques, this thesis sheds light on the crucial role of context in estimating speaker transitions within the verbal domain. The potential of incorporating TurnGPT into SDSs – employing a sampling-based strategy to predict upcoming speaker transitions from incomplete text, namely words yet to be transcribed by the ASR – is investigated to enhance system responsiveness. The VAP model, which predicts the joint voice activity of both dialogue interlocutors, is introduced and adapted to handle stereo channel audio. The model’s prosodic sensitivity is examined both in targeted utterances and in extended spoken dialogues. This analysis reveals that while intonation is crucial for distinguishing syntactically ambiguous events, it plays a less important role in general turn-taking within long-form dialogues. The VAP model’s analytical capabilities are also highlighted, to assess the impact of filled pauses and serve as an evaluation tool for conversational TTS, determining their ability to produce prosodically relevant turn-taking cues.