Skip to yearly menu bar Skip to main content


Poster

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Cong Wu · Xiao-Jun Wu · Linze Li · Tianyang Xu · Zhenhua Feng · Josef Kittler

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Project Page ]
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is been proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, current feature are decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the multi-level contrasts between text-visual and support-query are integrated to provide a comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. Please find our code in supplementary materials.

Live content is unavailable. Log in and register to view live content