Skip to main content

Investigation of auditory nerve model based analysis for vocoded speech synthesis

Time: Wed 2020-02-05 15.15

Location: Fantum (Lindstedtsvägen 24, floor 5, room no. 522)

Lecturer: Dr. Sébastien Le Maguer

In recent decades, the quality of speech synthesized by computers has
increased drastically. However, evaluating such systems remains a
challenge as the relevant methodologies haven't evolved for more than a
decade. Subjective evaluation provides a global overview of the quality,
but lacks any detailed feedback. Furthermore, research in objective
evaluation hasn't yet delivered any detailed analysis methodologies.

Inspired by the speech intelligibility and speech quality fields, we
investigate how we can use an Auditory Nerve (AN) model to improve
objective evaluation of speech synthesis systems. To do so, we compare
different configurations of HMM and \DNN synthesis using three different
metrics derived from spectrograms, mean-rate neurograms and fine-timing
neurograms. The three metrics are the euclidean distance, RMSE and the
Neurogram Similarity Index Measure (NSIM).

The results we obtain show that, to evaluate speech synthesis, comparing
mean-rate neurograms using the NSIM metric is an effective alternative
to the comparison of spectrograms. These results are promising as using
neurograms, based on AN models, redefines the objective analysis of
speech synthesis as a speech perception problem. This could be a new
step toward defining new methodologies to understand how humans are
perceiving speech synthesis.