ECCV Poster PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Poster

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Junsong Chen · Chongjian GE · Enze Xie · Yue Wu · Lewei Yao · Xiaozhe Ren · Zhongdao Wang · Ping Luo · Huchuan Lu · ZHENGUO LI

[ Abstract ] [ Project Page ] [ Paper PDF ]

[ Slides] [ Poster] [ Supplemental]

2024 Poster

Abstract:

In this paper, we introduce PixArt-Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Sigma represents a significant advancement over its predecessor, PixArt-Alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-Alpha, it evolves from the weaker' baseline to astronger' model via incorporating higher quality data, a process we term ``weak-to-strong training''. The advancements in PixArt-Sigma are twofold: (1) High-Quality Training Data: PixArt-Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size~(0.6B parameters) than existing text-to-image diffusion models, such as SDXL~(2.6B parameters) and SD Cascade~(5.1B parameters). Moreover, PixArt-Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Chat is not available.