Recently, deep learning models have achieved impressive success on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, \ie, reconstructing multiple high-fidelity frames from a single-shot observation. However, existing works lack an insight into the mixed degradation of spatial masking and temporal aliasing, and empirically follow the designs of successful plain video restoration (\eg, denoising, deblurring) models, limiting the overall performance. In this work, we tailor a network architecture and a Hierarchical Separable Video Transformer (HiSViT) as building block, composed of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN). CSS-MSA decomposes spatio-temporal similarity calculations into spatial and temporal dimensions but attends to all spatio-temporal tokens at a controllable scale within a single attention layer. GSM-FFN is design to bring locality to CSS-MSA via gated mechanism and space-time separable convolutions. HiSViT is built by multiple groups of CSS-MSA plus GSM-FFN, each of which focuses on different scales, enabling multi-scale interaction and long-range modeling. Extensive experiments demonstrate that our model achieves the state-of-the-art performance.
Live content is unavailable. Log in and register to view live content