Skip to main content
To KTH's start page

Visual Grounding in Dialogue

In situated dialogue, we talk about things in our shared environment. How can robots, when interacting with humans, understand and generate such references?

This project investigates how references to objects in a shared visual scene are made during dialogue, with a focus on understanding how dialogue context shapes these references. We explore how linguistic devices like anaphora and proforms (e.g., "it," "the red one") are used to maintain coherence when referring to objects, and how these referring expressions evolve throughout a conversation.One key aspect of this process is the formation of conceptual pacts - mutual agreements between dialogue participants on how to refer to specific objects or concepts. These pacts evolve as the conversation progresses, leading to more efficient and context-specific references over time. For example, once a speaker introduces an object as "the tall vase," both parties might later refer to it simply as "the vase", relying on the established shared understanding.

By leveraging multimodal large language models, we train models to both understand and generate references that take into account conceptual pacts and contextual cues. Ultimately, this research will contribute to the development of systems capable of seamless, cooperative interactions in visually grounded environments, such as human-robot dialogue.

Researchers

Gabriel Skantze
Gabriel Skantze professor
Bram Willemsen
Bram Willemsen

Publications

[1]
B. Willemsen and G. Skantze, "Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding," in 17th International Natural Language Generation Conference (INLG), 2024, pp. 453-469.
[2]
B. Willemsen, L. Qian and G. Skantze, "Resolving References in Visually-Grounded Dialogue via Text Generation," in Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, 2023, pp. 457-469.
[3]
B. Willemsen, D. Kalpakchi and G. Skantze, "Collecting Visually-Grounded Dialogue with A Game Of Sorts," in Proceedings of the 13th Conference on Language Resources and Evaluation, 2022, pp. 2257-2268.
[4]
G. Skantze and B. Willemsen, "CoLLIE : Continual Learning of Language Grounding from Language-Image Embeddings," The journal of artificial intelligence research, vol. 74, pp. 1201-1223, 2022.
[5]
T. Shore and G. Skantze, "Using lexical alignment and referring ability to address data sparsity in situated dialog reference resolution," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 2020, pp. 2288-2297.
[6]
T. Shore, T. Androulakaki and G. Skantze, "KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue," in LREC 2018 - 11th International Conference on Language Resources and Evaluation, 2019, pp. 768-775.
[7]
D. Kontogiorgos et al., "Multimodal reference resolution in collaborative assembly tasks," in Multimodal reference resolution in collaborative assembly tasks, 2018.

Funding

  • Robot learning of symbol grounding in multiple contexts through dialog (WASP, 2018-2023)
  • COIN: Co-adaptive human-robot interactive systems (SSF, 2016-2020)