Skip to main content

KTH and Wikipedia develop first crowdsourced speech engine

Published Mar 10, 2016

By 2017, English, Swedish and Arabic speakers will find that Wikipedia is talking their language — literally. The online free encyclopedia is collaborating with KTH Royal Institute of Technology to develop the world's first crowdsourced speech synthesis platform.

Swedish, English and Arabic will be the first languages launched on Wikipedia's synthesised speech platform. (Photo: David Callahan)

The platform will be optimised for Wikipedia but freely available as open source, and readily usable by any site that uses the MediaWiki software on which Wikimedia is based. 

Joakim Gustafson, a professor of Speech Technology at KTH, says that the project aims to provide access to Wikipedia and other wikis to people with reading difficulties or visual impairment.

Joakim Gustafson

"Initially our focus will be on Swedish language, where we will make use of our own language resources," Gustafson says. "Then we will do a basic English voice, which we expect to be quite good given the large amount of open source linguistic resources. And finally, we will do a rudimentary Arabic voice, that will be more a proof of concept."

An estimated 25 percent of all Wikipedia users — nearly 125 million people per month — need or prefer text in spoken form, according to Wikimedia Sweden.

Like Wikipedia's content, the speech output will be crowdsourced, with users contributing to the continuous development of the synthesizer.

Once the English, Swedish and Arabic speech engines are produced, sometime about September 2017, it will be possible with the help of users to extend synthesized speech to the remaining 280 languages in which Wikipedia is available. 

All material produced will be freely licensed and can be used for free by anyone, in line with the rules of Wikimedia Commons.

The Wikispeech pilot project is a collaboration between KTH, the Swedish Post and Telecom Authority, Wikimedia Sweden and STTS speech technology services. PTS is financing the project.

In addition, KTH will back the effort with a co-project on improved intonation modelling in speech synthesis. This project is supported by KTH's ICT TNG environment (https://www.ict-tng.kth.se) and will be led by Jonas Beskow, Professor of Speech Communication at KTH, and post-doc Zofia Malisz. 

David Callahan/Peter Larsson