Taga-VLM | Xiaoyan Li

We propose a novel Vision-Language Navigation (VLN) agent powered by Large Vision-Language Models (LVLMs). By leveraging the strong reasoning and multimodal alignment capabilities of LVLMs, the agent can autonomously navigate through previously unseen indoor environments by interpreting natural language instructions and analyzing panoramic visual observations. A key advantage of this framework is its versatility: it demonstrates robust navigation performance and generalization abilities in both discrete (graph-based) and continuous (Euclidean) environments, effectively bridging the gap between high-level semantic understanding and low-level action execution. More details will coming soon.

Visualization of the proposed VLN approch.

Generalizable Vision-Language Navigation across Diverse Scenarios.
Vision-Language Navigation with Look-Ahead Exploration.
Speaker-Follower Models for Vision-Language Navigation.

Related Projects