Vision-and-Language Navigation (VLN) involves guiding an agent through indoor environments using human-provided textual instructions. Coarse-grained VLN, with short and high-level instructions, has gained popularity as it closely mirrors real-world scenarios. However, a significant challenge is these instructions are often too concise for agents to comprehend and act upon. Previous studies have explored allowing agents to seek assistance during navigation, but typically offer rigid support from pre-existing datasets or simulators. The advent of Large Language Models (LLMs) presents a novel avenue for aiding VLN agents. This paper introduces VLN-Copilot, a framework enabling agents to actively seek assistance when encountering confusion, with the LLM serving as a copilot to facilitate navigation. Our approach includes the introduction of a confusion score, quantifying the level of uncertainty in an agent's action decisions, while the LLM offers real-time detailed guidance for navigation. Experimental results on two coarse-grained VLN datasets demonstrate the efficacy of our method.
Live content is unavailable. Log in and register to view live content