ECCV Poster Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Poster

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Ping Wang · Yulun Zhang · Lishun Wang · Xin Yuan

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

[ Abstract ] [ Paper PDF ]

[ Supplemental]

2024 Poster

Abstract:

Recently, deep learning models have achieved impressive success on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, \ie, reconstructing multiple high-fidelity frames from a single-shot observation. However, existing works lack an insight into the mixed degradation of spatial masking and temporal aliasing, and empirically follow the designs of successful plain video restoration (\eg, denoising, deblurring) models, limiting the overall performance. In this work, we tailor a network architecture and a Hierarchical Separable Video Transformer (HiSViT) as building block, composed of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN). CSS-MSA decomposes spatio-temporal similarity calculations into spatial and temporal dimensions but attends to all spatio-temporal tokens at a controllable scale within a single attention layer. GSM-FFN is design to bring locality to CSS-MSA via gated mechanism and space-time separable convolutions. HiSViT is built by multiple groups of CSS-MSA plus GSM-FFN, each of which focuses on different scales, enabling multi-scale interaction and long-range modeling. Extensive experiments demonstrate that our model achieves the state-of-the-art performance.

Chat is not available.