Visual Grounding in Dialogue
In situated dialogue, we talk about things in our shared environment. How can robots, when interacting with humans, understand and generate such references?
This project investigates how references to objects in a shared visual scene are made during dialogue, with a focus on understanding how dialogue context shapes these references. We explore how linguistic devices like anaphora and proforms (e.g., "it," "the red one") are used to maintain coherence when referring to objects, and how these referring expressions evolve throughout a conversation.One key aspect of this process is the formation of conceptual pacts - mutual agreements between dialogue participants on how to refer to specific objects or concepts. These pacts evolve as the conversation progresses, leading to more efficient and context-specific references over time. For example, once a speaker introduces an object as "the tall vase," both parties might later refer to it simply as "the vase", relying on the established shared understanding.
By leveraging multimodal large language models, we train models to both understand and generate references that take into account conceptual pacts and contextual cues. Ultimately, this research will contribute to the development of systems capable of seamless, cooperative interactions in visually grounded environments, such as human-robot dialogue.
Researchers
Publications
Funding
- Robot learning of symbol grounding in multiple contexts through dialog (WASP, 2018-2023)
- COIN: Co-adaptive human-robot interactive systems (SSF, 2016-2020)