*This was a HYBRID Event with in-person attendance in Levine 512 and Virtual attendance…
ABSTRACT
We consider the problem of Vision-and-Language Navigation (VLN) in previously unseen realistic indoor environments. Arguably, the biggest challenge in VLN is grounding the natural language to the visual input. The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric RGB-D observations of the agent. We are motivated by studies on navigation of biological systems that suggest humans build cognitive maps during such tasks. In contrast to other works, we argue that an egocentric map offers a more natural representation for this task. In this talk, we will explore a novel navigation system for the VLN task in continuous environments that learns a language-informed representation for both map and trajectory prediction. This approach semantically grounds the language through an egocentric map prediction task that learns to hallucinate information outside the field-of-view of the agent. This is followed by spatial grounding of the instruction by path prediction on the egocentric map. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.