Poster
Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning
Cong Wu · Xiao-Jun Wu · Linze Li · Tianyang Xu · Zhenhua Feng · Josef Kittler
# 143
Strong Double Blind |
The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is been proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, current feature are decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the multi-level contrasts between text-visual and support-query are integrated to provide a comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. Please find our code in supplementary materials.
Live content is unavailable. Log in and register to view live content