Skip to main content

Found speech and humans in the loop

Ways to gain insight into large quantities of speech

Time: Fri 2022-03-18 14.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Video link: https://kth-se.zoom.us/j/62813774919

Language: English

Subject area: Speech and Music Communication

Doctoral student: Per Fallgren , Tal-kommunikation

Opponent: Associate Professor Fred Cummins, University College Dublin, Belfield, Dublin, Irland

Supervisor: Docent/Associate Professor Jens Edlund, Tal, musik och hörsel, TMH

QC 20220222

Abstract

Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-309031