Skip to main content
To KTH's start page

Probabilistic Speech & Motion Synthesis

Towards More Expressive and Multimodal Generative Models

Time: Fri 2025-09-12 13.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Video link: https://kth-se.zoom.us/j/69476396694

Language: English

Subject area: Computer Science

Doctoral student: Shivam Mehta , Tal, musik och hörsel, TMH

Opponent: Dr Robert A. J. Clark, Google UK

Supervisor: Assistant Professor Gustav Eje Henter, Tal, musik och hörsel, TMH; Jonas Beskow, Tal, musik och hörsel, TMH

Export to calendar

QC 20250814

Abstract

Human communication is richly multimodal, combining speech with co-speech gestures to convey meaning, intention, and affect. Both modalities are shaped by context and communicative intent, and exhibit substantial variability in timing, prosody, and motion. Accurately generating these behaviors from text presents a fundamental challenge in artificial intelligence. Traditional deterministic systems fall short in capturing this diversity, leading to oversmoothed, repetitive outputs that lack spontaneity. This thesis addresses these limitations by developing a sequence of probabilistic generative models for high-quality, efficient, and expressive synthesis of speech and co-speech gestures from textual input.

We begin by advancing probabilistic text-to-speech (TTS) through the integration of monotonic alignment and duration modeling via neural Hidden Markov Models (HMMs). These models replace attention mechanisms with a left-to-right HMM with emissions parameterized via neural networks and offer robust, data-efficient training with exact likelihood optimization and controllable prosody. Building on this foundation, we introduce OverFlow, a framework that combines neural HMMs with normalizing flows to model the complex, non-Gaussian distribution of speech acoustics. This enables fully probabilistic modeling and sampling of audio features with improved likelihood and naturalness. To achieve faster yet expressive synthesis, we present Matcha-TTS, a non-autoregressive (NAR) TTS system trained with optimal-transport conditional flow matching (OT-CFM). This model leverages efficient ODE-based sampling and a lightweight convolutional transformer architecture, significantly reducing the number of synthesis steps needed while maintaining high perceptual quality. We further investigate probabilistic duration modeling in the context of fast non-autoregressive TTS models and demonstrate that probabilistic modeling substantially benefits spontaneous speech synthesis, where duration variability is high and deterministic models underperform. Expanding from unimodal to multimodal generation, we explore the joint synthesis of speech and co-speech gesture. Diff-TTSG introduces a diffusion-based framework for integrated generation using double diffusion decoders, while Match-TTSG improves synthesis speed and coherence by extending OT-CFM to the multimodal domain with the help of a unified decoder. Match-TTSG learns the joint distribution over acoustic and gestural features, enabling synchronized and cross-modally appropriate output from a single generative process. To address data scarcity in multimodal corpora, we propose Fake it to make it, a two-stage strategy where synthetic data generated from powerful unimodal models is used to pretrain a multimodal generative system, yielding improved downstream performance. Finally, the thesis transitions to discrete audio modeling and large language models (LLMs). We propose LM-MSN, which combines variational quantization with flow-matching reconstruction to produce low-bitrate discrete audio tokens. This facilitates early fusion of audio and text tokens and enables multimodal LLM training for both audio comprehension and generation. Together, the contributions of this thesis represent a coherent progression from probabilistic speech synthesis to unified multimodal generation and scalable discrete modeling. By leveraging expressive generative modeling across modalities, we demonstrate how probabilistic modeling can overcome the limitations of deterministic synthesis and move towards more natural, controllable, and expressive communicative AI.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-368342