One of the most persistent visions in science fiction and research alike is the ability to communicate with machines through speech. Realizing the vision has proved hard, and decades of research in speech technology, computational linguistics and artificial intelligence revealed much greater scientific challenges than anticipated. In recent years, however, we have seen robust improvements. Today, speech interfaces in applications such as voice assistants in mobile phones and cars see a steady increase in number of users, and they are the obvious interaction of choice in new fields such as social robotics.
Conversational systems research at KTH Speech, Music and Hearing seeks to make interactions with these systems more fluent and the systems more human-like. For instance, when humans speak to one another, we consider not only what the other person says, but also how it is said (e.g. the prosody), their facial expression, where they are looking, and so on. A conversational system has to be able interpret and respond to these signals accurately, and to generate them appropriately, or it will likely appear either stupid, strange, or both. A number of other skills, often so natural to humans that we rarely think about them, must also be replicated in the computer for human-like communication to take place. We easily conduct conversations with more than one person at a time, we show awareness of our surroundings, and we refer effortlessly to things that exist in the space in which the conversation takes place.
An aspect of conversation that is marginalized and largely brushed aside in traditional spoken dialogue systems is that conversation is a real-time process. Conversations between people are often described as emergent and co-constructed, that is they are produced continuously by all participants, rather than turn by turn by one participant at a time. At the very least, people start planning what to say before the other person has finished speaking. This is an essential feature for conversational systems as well, as excessive silence after an utterance can easily be interpreted as meaningful (perhaps as hesitation of scepticism) or may cause the listener to lose interest. We therefore study dialogue models that are designed to be real-time. As such models are hard to get right by manual design alone, we explore the use of machine learning to allow them to learn directly both from human-human conversation and from their own use in conversations with humans.