Skip to yearly menu bar Skip to main content


Poster

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality

Kyu Ri Park · Hong Joo Lee · Jung Uk Kim

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Project Page ]
Tue 1 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Current research on Audio-Visual Question Answering (AVQA) tasks typically requires complete visual and audio input to effectively understand scenes and answer the questions (text). However, in real-world scenarios, problems such as device malfunction or transmission errors are common, resulting in the occasional absence of audio or visual information. Such scenarios significantly degrade the performance of AVQA networks. To address these challenges, we propose a novel AVQA framework that effectively handles missing modalities and provides answers to questions even when audio or visual information is missing. Our framework proposes a Relation-aware Missing Modal (RMM) generator, inspired by human sensory association, to create pseudo features that retrieve missing modality information by correlating available modality cues. Then, we introduce an Audio-Visual Relation-aware (AVR) diffusion model to improve both the overall audio-visual feature representations (missing modality pseudo feature and original modality feature) by considering the associations between them. As a result, our approach outperforms state-of-the-art AVQA works, even in the cases where audio or visual modalities are missing. We believe that our method enables realistic studies in AVQA networks and has the potential for application in various multi-modal scenarios. The code will be made publicly available.

Live content is unavailable. Log in and register to view live content