Publications by Gustav Henter

Peer reviewed

Articles

[1]

T. Kucherenko et al., "Evaluating Gesture Generation in a Large-scale Open Challenge : The GENEA Challenge 2022," ACM Transactions on Graphics, vol. 43, no. 3, 2024.

[2]

P. Wolfert, G. E. Henter and T. Belpaeme, "Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour," Applied Sciences, vol. 14, no. 4, 2024.

[3]

S. Nyatsanga et al., "A Comprehensive Review of Data-Driven Co-Speech Gesture Generation," Computer graphics forum (Print), vol. 42, no. 2, pp. 569-596, 2023.

[4]

S. Alexanderson et al., "Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models," ACM Transactions on Graphics, vol. 42, no. 4, 2023.

[5]

J. G. De Gooijer, G. E. Henter and A. Yuan, "Kernel-based hidden Markov conditional densities," Computational Statistics & Data Analysis, vol. 169, 2022.

[6]

T. Kucherenko et al., "Moving Fast and Slow : Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation," International Journal of Human-Computer Interaction, vol. 37, no. 14, pp. 1300-1316, 2021.

[7]

P. Jonell et al., "Multimodal Capture of Patient Behaviour for Improved Detection of Early Dementia : Clinical Feasibility and Preliminary Results," Frontiers in Computer Science, vol. 3, 2021.

[8]

G. Valle-Perez et al., "Transflower : probabilistic autoregressive dance generation with multimodal attention," ACM Transactions on Graphics, vol. 40, no. 6, 2021.

[9]

G. E. Henter, S. Alexanderson and J. Beskow, "MoGlow : Probabilistic and controllable motion synthesis using normalising flows," ACM Transactions on Graphics, vol. 39, no. 6, pp. 1-14, 2020.

[10]

S. Alexanderson et al., "Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows," Computer graphics forum (Print), vol. 39, no. 2, pp. 487-496, 2020.

[11]

G. E. Henter and W. B. Kleijn, "Minimum entropy rate simplification of stochastic processes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2487-2500, 2016.

[12]

P. N. Petkov, G. E. Henter and W. B. Kleijn, "Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 1035-1045, 2013.

[13]

G. E. Henter and W. B. Kleijn, "Picking up the pieces : Causal states in noisy data, and how to recover them," Pattern Recognition Letters, vol. 34, no. 5, pp. 587-594, 2013.

Conference papers

[14]

S. H. Bokkahalli Satish, G. E. Henter and É. Székely, "When Voice Matters : Evidence of Gender Disparity in Positional Bias of SpeechLLMs," in Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings, 2026, pp. 25-38.

[15]

S. H. Bokkahalli Satish, G. E. Henter and É. Székely, "Hear Me Out : Interactive evaluation and bias discovery platform for speech-to-speech conversational AI," in Interspeech 2025, 2025, pp. 2151-2152.

[16]

V. S. Lodagala et al., "SawtArabi : A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching," in Interspeech 2025, 2025, pp. 4793-4797.

[17]

P. Tuttösí et al., "Take a Look, it's in a Book, a Reading Robot," in HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, 2025, pp. 1803-1805.

[18]

U. Wennberg and G. E. Henter, "Exploring Internal Numeracy in Language Models: A Case Study on ALBERT," in MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings, 2024, pp. 35-40.

[19]

S. Mehta et al., "Fake it to make it : Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1952-1964.

[20]

S. Mehta et al., "Fake it to make it : Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis," in Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, 2024, pp. 1952-1964.

[21]

Y. Yoon et al., "GENEA Workshop 2024 : The 5th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents," in PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024, 2024, pp. 694-695.

[22]

U. Wennberg and G. E. Henter, "Learned Transformer Position Embeddings Have a Low-Dimensional Structure," in ACL 2024 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 - Proceedings of the Workshop, 2024, pp. 237-244.

[23]

S. Mehta et al., "MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING," in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, 2024, pp. 11341-11345.

[24]

S. Mehta et al., "Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech," in Interspeech 2024, 2024, pp. 2285-2289.

[25]

S. Mehta et al., "Unified speech and gesture synthesis using flow matching," in 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, pp. 8220-8224.

[26]

P. Wolfert, G. E. Henter and T. Belpaeme, ""Am I listening?", Evaluating the Quality of Generated Data-driven Listening Motion," in ICMI 2023 Companion : Companion Publication of the 25th International Conference on Multimodal Interaction, 2023, pp. 6-10.

[27]

S. Wang et al., "A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS," in ICASSPW 2023 : 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, 2023.

[28]

P. Pérez Zarazaga, G. E. Henter and Z. Malisz, "A processing framework to access large quantities of whispered speech found in ASMR," in ICASSP 2023 : 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

[29]

J. J. Webber et al., "Autovocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing," in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings, 2023.

[30]

S. Mehta et al., "Diff-TTSG : Denoising probabilistic integrated speech and gesture synthesis," in Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, 2023, pp. 150-156.

[31]

Y. Yoon et al., "GENEA Workshop 2023 : The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents," in ICMI 2023 : Proceedings of the 25th International Conference on Multimodal Interaction, 2023, pp. 822-823.

[32]

S. Mehta et al., "OverFlow : Putting flows on top of neural transducers for better TTS," in Interspeech 2023, 2023, pp. 4279-4283.

[33]

H. Lameris et al., "Prosody-Controllable Spontaneous TTS with Neural HMMs," in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.

[34]

P. Pérez Zarazaga et al., "Speaker-independent neural formant synthesis," in Interspeech 2023, 2023, pp. 5556-5560.

[35]

T. Kucherenko et al., "The GENEA Challenge 2023 : A large-scale evaluation of gesture generation models in monadic and dyadic setings," in Proceedings Of The 25Th International Conference On Multimodal Interaction, Icmi 2023, 2023, pp. 792-801.

[36]

P. Wolfert et al., "GENEA Workshop 2022 : The 3rd Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents," in ACM International Conference Proceeding Series, 2022, pp. 799-800.

[37]

T. Kucherenko et al., "Multimodal analysis of the predictability of hand-gesture properties," in AAMAS '22: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, 2022, pp. 770-779.

[38]

S. Mehta et al., "Neural HMMs are all you need (for high-quality attention-free TTS)," in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7457-7461.

[39]

C. Valentini-Botinhao et al., "Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks," in INTERSPEECH 2022, 2022, pp. 471-475.

[40]

J. Fong et al., "Speech Audio Corrector : using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech," in INTERSPEECH 2022, 2022, pp. 1213-1217.

[41]

Y. Yoon et al., "The GENEA Challenge 2022 : A large evaluation of data-driven co-speech gesture generation," in ICMI 2022 : Proceedings of the 2022 International Conference on Multimodal Interaction, 2022, pp. 736-747.

[42]

G. Beck et al., "Wavebender GAN : An architecture for phonetically meaningful speech manipulation," in 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022.

[43]

T. Kucherenko et al., "A large, crowdsourced evaluation of gesture generation systems on common data : The GENEA Challenge 2020," in Proceedings IUI '21: 26th International Conference on Intelligent User Interfaces, 2021, pp. 11-21.

[44]

M. M. Sorkhei, G. E. Henter and H. Kjellström, "Full-Glow : Fully conditional Glow for more realistic image generation," in Pattern Recognition : 43rd DAGM German Conference, DAGM GCPR 2021, 2021, pp. 697-711.

[45]

T. Kucherenko et al., "GENEA Workshop 2021 : The 2nd Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents," in Proceedings of ICMI '21: International Conference on Multimodal Interaction, 2021, pp. 872-873.

[46]

P. Jonell et al., "HEMVIP: Human Evaluation of Multiple Videos in Parallel," in ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 707-711.

[47]

S. Wang et al., "Integrated Speech and Gesture Synthesis," in ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 177-185.

[48]

T. Kucherenko et al., "Speech2Properties2Gestures : Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech," in IVA '21 : Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, 2021, pp. 145-147.

[49]

U. Wennberg and G. E. Henter, "The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models," in ACL-IJCNLP 2021 : THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, pp. 130-140.

[50]

É. Székely et al., "Breathing and Speech Planning in Spontaneous Speech Synthesis," in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7649-7653.

[51]

S. Alexanderson et al., "Generating coherent spontaneous speech and gesture from text," in Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, 2020.

[52]

T. Kucherenko et al., "Gesticulator : A framework for semantically-aware speech-driven gesture generation," in ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, 2020.

[53]

P. Jonell et al., "Let’s face it : Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings," in IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, 2020.

[54]

K. Håkansson et al., "Robot-assisted detection of subclinical dementia : progress report and preliminary findings," in In 2020 Alzheimer's Association International Conference. ALZ., 2020.

[55]

A. Ghosh et al., "Robust classification using hidden markov models and mixtures of normalizing flows," in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), 2020.

[56]

S. Alexanderson and G. E. Henter, "Robust model training and generalisation with Studentising flows," in Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2020, pp. 25:1-25:9.

[57]

T. Kucherenko et al., "Analyzing Input and Output Representations for Speech-Driven Gesture Generation," in 19th ACM International Conference on Intelligent Virtual Agents, 2019.

[58]

É. Székely, G. E. Henter and J. Gustafson, "Casting to Corpus : Segmenting and Selecting Spontaneous Dialogue for TTS with a CNN-LSTM Speaker-Dependent Breath Detector," in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6925-6929.

[59]

É. Székely et al., "How to train your fillers: uh and um in spontaneous speech synthesis," in The 10th ISCA Speech Synthesis Workshop, 2019.

[60]

É. Székely et al., "Off the cuff : Exploring extemporaneous speech delivery with TTS," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 3687-3688.

[61]

T. Kucherenko et al., "On the Importance of Representations for Speech-Driven Gesture Generation : Extended Abstract," in International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada, 2019, pp. 2072-2074.

[62]

P. Wagner et al., "Speech Synthesis Evaluation : State-of-the-Art Assessment and Suggestion for a Novel Research Program," in Proceedings of the 10th Speech Synthesis Workshop (SSW10), 2019.

[63]

É. Székely et al., "Spontaneous conversational speech synthesis from found data," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 4435-4439.

[64]

Z. Malisz et al., "The speech synthesis phoneticians need is both realistic and controllable," in Proceedings from FONETIK 2019, 2019.

[65]

O. Watts et al., "Where do the improvements come from in sequence-to-sequence neural TTS?," in Proceedings of the 10th ISCA Speech Synthesis Workshop, 2019, pp. 217-222.

[66]

P. N. Petkov, W. B. Kleijn and G. E. Henter, "Enhancing Subjective Speech Intelligibility Using a Statistical Model of Speech," in 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 1, 2012, pp. 166-169.

[67]

G. E. Henter, M. R. Frean and W. B. Kleijn, "Gaussian process dynamical models for nonparametric speech representation and synthesis," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 2012, pp. 4505-4508.

[68]

G. E. Henter and W. B. Kleijn, "Intermediate-State HMMs to Capture Continuously-Changing Signal Features," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, pp. 1828-1831.

[69]

G. E. Henter and W. B. Kleijn, "Simplified Probability Models for Generative Tasks : a Rate-Distortion Approach," in Proceedings of the European Signal Processing Conference, 2010, pp. 1159-1163.

Non-peer reviewed

Conference papers

[70]

H. Lameris et al., "Spontaneous Neural HMM TTS with Prosodic Feature Modification," in Proceedings of Fonetik 2022, 2022.

Theses

[71]

G. E. Henter, "Probabilistic Sequence Models with Speech and Language Applications," Doctoral thesis Stockholm : KTH Royal Institute of Technology, Trita-EE, 2013:042, 2013.

Other

[72]

S. Wang et al., "A comparative study of self-supervised speech representationsin read and spontaneous TTS," (Manuscript).

[73]

G. E. Henter, S. Alexanderson and J. Beskow, "Moglow : Probabilistic and controllable motion synthesis using normalising flows," (Manuscript).

[74]

G. E. Henter, A. Leijon and W. B. Kleijn, "Kernel Density Estimation-Based Markov Models with Hidden State," (Manuscript).

[75]

G. E. Henter and W. B. Kleijn, "Minimum Entropy Rate Simplification of Stochastic Processes," (Manuscript).

[76]

T. Kucherenko et al., "The GENEA Challenge 2020 : Benchmarking gesture-generation systems on common data," (Manuscript).

Latest sync with DiVA:

2026-01-18 00:41:03 UTC

Studies

Research

Collaboration

About KTH

Library

Publications by Gustav Henter

Peer reviewed

Articles

Conference papers

Non-peer reviewed

Conference papers

Theses

Other

Contact