Poster

Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining

Diwei Su · cheng fei · Jianxu Luo

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Project Page Paper PDF [ Poster] [ Supplemental]

Abstract

In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, Transformers in CV face challenges in handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of Transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter and MaskDINO, exhibiting promising performance on three tasks, including semantic segmentation, instance segmentation, and panoptic segmentation. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights.

Chat is not available.