Speech and Language Technologies

Minimally, core speech technology refers to the transcription of speech to words (Automatic Speech Recognition, ASR), the generation of speech from written words (speech synthesis or Text-to-Speech, TTS), and the understanding of what the words mean (spoken language understanding, akin to the parsing of written language technology), and translating concept to words (language generation). A number of related technologies, for example speaker verification and voice activity detection, are nearly as fundamental.

Each of these core technologies is a research area within computer science in its own right. When researched as stand-alone fields, the main goal is basically to find the most efficient algorithms under strictly controlled circumstances.

Minimally, core speech technology refers to the transcription of speech to words (Automatic Speech Recognition, ASR), the generation of speech from written words (speech synthesis or Text-to-Speech, TTS), and the understanding of what the words mean (spoken language understanding, akin to the parsing of written language technology), and translating concept to words (language generation). A number of related technologies, for example speaker verification and voice activity detection, are nearly as fundamental. Each of these core technologies is a research area within computer science in its own right. When researched as stand-alone fields, the main goal is basically to find the most efficient algorithms under strictly controlled circumstances.

At KTH Speech, Music and Hearing, our focus on the human in the speech technology loop has led us to take the road less travelled when it comes to these core technologies, and investigate their use in speech research and in speech technology applications. In this context, these fields are much less orderly, and their development and evaluation becomes a multidisciplinary effort. ASR that is to be used in human-like dialogues may not need to be 100% accurate (human listeners are not), but it cannot require a full utterance to be spoken before it delivers a result – this has to happen incrementally as the speech is produced. And it is not enough to know what words are spoken in order to understand human speech, we must also know how they are spoken and in what context – information that is usually ignored in ASR research. To take another typical example, speech synthesis that is to be used in interaction cannot produce monolithic utterances that are played from beginning to end regardless of what happens in the meantime, nor can it produce utterances that sound identical each time if they contain the same string of words. Instead, TTS for conversation need to be able to be able to halt and restart, to hesitate, to speak softer and louder dynamically as the acoustic environment in the room changes, and so on. Furthermore, since we have a focus on situated, face-to-face, interaction the understanding also takes into account visual input (using motion capture, and computer vision), and the output generation includes generation of lip movements, facial gestures, communicative gestures and body postures.

In general, KTH Speech, Music and Hearing looks to further develop the core speech technologies to be able to deal with conversational, naturally occurring real world speech, and to be useful in real-world situations and in real applications.

Top page top