Skip to yearly menu bar Skip to main content


Poster

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan · Mengping Yang · Luozheng Qin · Hao Yang · Ye Qian · Qiang Zhou · Cheng Zhang · Hao Li

# 227
[ ] [ Project Page ] [ Paper PDF ]
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: Recent years have witnessed thrilling progress in Text-to-Image (T2I) generation with the advance of diffusion models. One critical prerequisite for faithful image creation is the precise understanding of input text conditions, wherein most existing models borrow CLIP text encoder to model the language input. Despite its widespread use in the field of T2I generation, CLIP text encoder has several drawbacks: it can only encode English language and its max token length is very limited, \emph{i.e.,} only $77$. Additionally, CLIP is derived from a relatively small model size thus restricts its verbal ability. To address this, this paper leverages Large Language Models (LLMs) as the text encoder to improve the language understanding of T2I diffusion models. The text features from LLMs usually support multiple languages, and the context length has longer accommodation, as well as providing superior text expression ability, thereby improving the synthesis quality. Considering that training T2I diffusion models with textual features of LLMs from scratch requires massive computational resources and a considerable amount of text-image data, we develop an innovative three-stage training strategy that effectively and efficiently integrates the existing text-to-image model with a large language model. The key ingredient of our model is a lightweight adapter that enables fast training of T2I diffusion models based on LLMs textual representations, and simultaneously, preserves the language power of LLMs. Extensive experiments results confirm that our model not only supports multiple languages but also achieves superior image generation performance, evidenced by both automatic metrics and human evaluation. Our code and models will be made released.

Live content is unavailable. Log in and register to view live content