Thanks to the excellent generalization capability of pre-trained Vision-Language Models (VLMs) such as CLIP, fine-tuning VLMs for downstream tasks (e.g., zero-shot generalization) has become a popular choice. Despite achieving promising performance in the professionality of base classes, most existing fine-tuned methods suffer from feature confusion of novel classes, resulting in unsatisfactory transferability. To address this problem, we propose a divide-and-conquer approach called Prompt-based Variational Adapter (PVA) that explicitly reduce the prediction bias by separating base and novel samples. Specifically, we design two variational adapters with learnable textual tokens to align latent representations for each modalities in a shared latent space. Once trained, we can separate novel samples from entangled space using the similarity metric of latent features i.e., converting confusion task into two independent ones (One for base classes and the other for novel classes). Moreover, to improve the transferability for novel classes, we further refine the output features of the learned adapters with the global features via a residual connection. To the best of our knowledge, this is the first framework which combines prompt learning and adapter tuning to tackle the feature confusion issue. We conduct extensive experiments on GZSL and Cross-Dataset Transfer Learning to demonstrate the superiority of our approach and establish a new state-of-the-art on four popular benchmarks.
Live content is unavailable. Log in and register to view live content