Skip to yearly menu bar Skip to main content


Poster

Learning Representations from Foundation Models for Domain Generalized Stereo Matching

Yongjian Zhang · Longguang Wang · Kunhong Li · WANG Yun · Yulan Guo

# 324
Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Paper PDF ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance.

Live content is unavailable. Log in and register to view live content