Skip to main content

Area 2: Human Communication and Behavior

Communicating humansIn this area we develop models of how humans perceive and produce non-verbal communication. This can be used both to gain understanding about the mechanisms underlying human communication and behavior, and also to design systems where communication and behavior understanding is used, e.g. for computerized analysis of cognitive decline.


Research Engineers

  • Gianluca Volkmer
  • Athanasios Charisoudis (2022-2023)
  • Ricky Molén (2022)
  • Carles Balsells Rodas (2019-2021)
  • Chenda Zhang (2020)
  • Moein Sorkhei (2018-2020)

MSc Students

  • Fanxuan Liu
  • Ioannis Athanasiadis (MSc 2022)
  • Frans Nordén (MSc 2021)
  • Olga Mikheeva (MSc 2017)

PhD Students

Post Docs


Current Projects

Generative AI for recognition of human behavior (SeRC 2023-present)


This project, which is part of the SeRC Data Science MCP, operates in interaction with OrchestrAI in Area 1, Detecting behavioral biomarkers, STING and UNCOCO in Area 2, and Generative AI for the creation of artificial spiderweb in Area 4 in developing generative AI methods to model and recognize human non-verbal behavior from 3D body, face and hand pose, gaze behavior and RGB video, in context of language, knowledge representations and EEG.


Detecting behavioral bio-markers (KTH, KI 2023-present)

In this project, which is part of the SeRC eMPH MCP and a collaboration with the Department of Women’s and Children’s health at Karolinska Institutet, we develop representation learning methods to detect various kinds of bio-markers, typically connected to underlying motor or cognitive conditions, from non-verbal behavior. The currently primary application is detection of motor conditions in neonates, but we will also work with datasets from other applications, containing 3D body, face, hand pose, gaze behavior, RGB video, IMU data, and other kinds of measurements. The underlying mechanisms are modeled using deep generative approaches such as VAE and Diffusion Models.


STING: Synthesis and analysis with Transducers and Invertible Neural Generators (WASP 2022-present)

STING logo

Human communication is multimodal in nature, and occurs through combinations of speech,
language, gesture, facial expression, and similar signals. To enable natural interactions with human beings, artificial agents must be capable of both analysing and producing these rich and
interdependent signals, and connect them to their semantic implications. Unfortunately, even the strongest machine learning methods currently fall short of this goal: automated semantic understanding of human behaviour remains superficial, and generated agent behaviours are empty gestures lacking the ability to convey meaning and communicative intent.

The STING NEST, part of the WARA Media and Language, intends to change this state of affairs by uniting synthesis and analysis with transducers and invertible neural models. This involves connecting concrete, continuous­ valued sensory data such as images, sound, and motion, with high­ level, predominantly discrete, representations of meaning, which has the potential to endow synthesis output with human­ understandable high­level explanations, while simultaneously improving the ability to attach probabilities to semantic representations. The bi­directionality also allows us to create efficient mechanisms for explainability, and to inspect and enforce fairness in the models.
Recent advances in generative models suggest that our ambitious research agenda is likely to be met with success. Normalising flows and variational autoencoders permit both extract­ing disentangled representations of observations, and (re­-)generating observations from these abstract representations, all within a single model. Their recent extensions to graph­ structured data are of particular interest because graphs are commonly­ used semantic representations.
This opens the door not only to generating structured information, but also to capturing the com­position of the generation itself (which is a graph in its own right) by exploiting and transferring techniques from finite ­state transducers and graph grammars.


Project home page

UNCOCO: UNCOnscious COmmunication (WASP, KTH, KI 2020-present)


This project, which is part of the WARA Media and Language and a collaboration with the Perceptual Neuroscience group at KI, entails two contributions.

Firstly, we develop a 3D embodied, integrated representation of head pose, gaze and facial micro expression, that can be extracted from a regular 60 Hz video camera and a desk-mounted gaze sensor. The embodied, integrated 3D representation of head pose, gaze and facial micro expression provides a preprocessing step to the second contribution, a deep generative model for inferring the latent emotional state of the human from the non-verbal communicative behavior. The model is employed in three different contexts: 1) estimating user affect for a digital avatar, 2) analyzing human non-verbal behavior connected to sensor stimuli, e.g., quantify approach/avoidance motor response to smell, 3) estimating frustration in a driving scenario.


HiSS: Humanizing the Sustainable Smart city (KTH Digital Futures, 2019-present)

Description in Area 4: Embodied Representation Learning

Past Projects

EACare: Embodied Agent to support elderly mental wellbeing (SSF, 2016-2021)

The main goal of the multidisciplinary project EACare is to develop an embodied agent – a robot head with communicative skills – capable of interacting with especially elderly people at a clinic or in their home, analyzing their mental and psychological status via powerful audiovisual sensing and assessing their mental abilities to identify subjects in high risk or possibly at the first stages of cognitive decline, with a special focus on Alzheimer’s disease. The interaction is performed according to the procedures developed for memory evaluation sessions, the key part of the diagnostic process for detecting cognitive decline.
This new diagnostic system will be one of the means by which medical
  doctors evaluate people for cognitive decline, in parallel to the
  existing methods such as memory evaluation sessions with a (human)
  clinician, MRI scans, blood tests, etc. Different parts of the
  framework can also be used for other purposes, such as to develop
  tools for dementia preventive training and for decision support
  during clinical memory evaluation sessions.


Project home page

Data-driven modelling of interaction skills for social robots (KTH, 2016-2018)

Description in Area 4: Embodied Representation Learning

HumanAct: Visual and multi-modal learning of Human Activity and interaction with the surrounding scene (VR, EIT ICT Labs, 2010-2013)

Description in Area 4: Embodied Representation Learning