Skip to yearly menu bar Skip to main content


Poster

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu · Long Chen · Jan Hünermann · Alice Karnsund · Benoit Hanotte · Prajwal Chidananda · Saurabh Nair · Vijay Badrinarayanan · Alex Kendall · Jamie Shotton · Elahe Arani · Oleg Sinavski

[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

We introduce LingoQA, a novel dataset and benchmark for video question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 56.67% of the questions compared to 93.4% for humans. For evaluation, in addition to conducting a human study, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

Live content is unavailable. Log in and register to view live content