Poster

Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

Yijun Qian · Jack Urbanek · Alexander Hauptmann · Jungdam Won

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Paper PDF

Abstract

The field of 3D human motion generation from natural language descriptions, known as Text2Motion, has gained significant attention for its potential application in industries such as film, gaming, and AR/VR. To tackle a key challenge in Text2Motion, the deficiency of 3D human motions and their corresponding textual descriptions, we built a novel large-scale 3D human motion dataset, LaViMo, extracted from in-the-wild web videos and action recognition datasets. LaViMo is approximately 3.3 times larger and encompasses a much broader range of actions than the largest available 3D motion dataset. We then introduce a novel multi-task framework TMT (Text Motion Translator), aimed at generating faithful 3D human motions from natural language descriptions, especially focusing on complicated actions and those not existing in the training set. In contrast to prior works, TMT is uniquely regularized by multiple tasks, including Text2Motion, Motion2Text, Text2Text, and Motion2Motion. This multi-task regularization significantly bolsters the model's robustness and enhances its ability of motion modeling and semantic understanding. Additionally, we devised an augmentation method for the textual descriptions using Large Language Models. This augmentation significantly enhances the model's capability to interpret open-vocabulary descriptions while generating motions. The results demonstrate substantial improvements over existing state-of-the-art methods, particularly in handling diverse and novel motion descriptions, laying a strong foundation for future research in the field.

Chat is not available.