Skip to yearly menu bar Skip to main content


Poster

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Seokha Moon · Hyun Woo · Hongbeen Park · Haeji Jung · Reza Mahjourian · Hyung-gun Chi · Hyerin Lim · Sangpil Kim · Jinkyu Kim

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs to a model which predicts agent trajectories. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior trajectory prediction methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision to guide the model on what to learn from the input data. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Despite using these extra inputs, our method achieves a latency of 53 ms, significantly lower than that of previous single-agent prediction methods with similar performance. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction.

Live content is unavailable. Log in and register to view live content