Skip to yearly menu bar Skip to main content


Poster

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Chuanhao Li · Zhen Li · Chenchen Jing · Yuwei Wu · Mingliang Zhai · Yunde Jia

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Project Page ]
Wed 2 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Compositional generalization has received much attention in vision-and-language and visual reasoning recently. Substitutivity, the capability to generalize to novel compositions with synonymous primitives such as words and visual entities, is an essential factor in evaluating the compositional generalization ability but remains largely unexplored. In this paper, we explore the compositional substitutivity of visual reasoning in the context of visual question answering (VQA). We propose a training framework for VQA models to maintain compositional substitutivity. The basic idea is to learn invariant representations for synonymous primitives via support-sets. Specifically, for each question-image pair, we construct a support question set and a support image set, and both sets contain questions/images that share synonymous primitives with the original question/image. By enforcing a VQA model to reconstruct the original question/image with the sets, the model is able to identify which primitives are synonymous. To quantitatively evaluate the substitutivity of VQA models, we introduce two datasets: GQA-SPS and VQA-SPS v2, by performing three types of substitutions using synonymous primitives including words, visual entities, and referents. Experimental results demonstrate the effectiveness of our framework. We release GQA-SPS and VQA-SPS v2 at https://github.com/NeverMoreLCH/CG-SPS.

Live content is unavailable. Log in and register to view live content