Poster
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Kent Fujiwara · Mikihiro Tanaka · Qing Yu
# 134
Strong Double Blind |
With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. However, despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the temporal understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text between the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance under conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequences of events as negative samples to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider the temporal elements of motion-language alignment.
Live content is unavailable. Log in and register to view live content