Poster

Efficient Vision Transformers with Partial Attention

Xuan-Thuy Vo · Duy-Linh Nguyen · Adri Priadana · Kang-Hyun Jo

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Paper PDF [ Poster] [ Supplemental]

Abstract

As a core of Vision Transformer (ViT), self-attention has high flexibility in modeling long-range dependencies because every query attends to all spatial locations. Although ViT achieves promising performance in visual tasks, self-attention's complexity is quadratic with token lengths. This leads to challenging problems when transferring ViT models to dense prediction tasks that require high input resolutions. Previous arts have tried to solve this problem by introducing sparse attention such as spatial reduction attention, and window attention. One common point of these methods is that all image/window tokens are joined during computing attention weights. In this paper, we find out that there exist high similarities between attention weights and incur computation redundancy. To address this issue, this paper proposes novel attention, called partial attention, that learns spatial interactions more efficiently, by reducing redundant information in attention maps. Each query in our attention only interacts with a small set of relevant tokens. Based on partial attention, we design an efficient and general vision Transformer, named PartialFormer, that attains good trade-offs between accuracy and computational costs across vision tasks. For example, on ImageNet-1K, PartialFormer-B3 outperforms Swin-T by 1.7% Top-1 accuracy while saving 25% GFLOPs, and Focal-T by 0.8% while saving 30% GFLOPs.

Chat is not available.