LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

ICLR 2025

Zhe Li1,2*, Weihao Yuan2*, Yisheng He2, Lingteng Qiu2, Shenhao Zhu2,3, Xiaodong Gu2, Weichao Shen2, Yuan Dong2, Zilong Dong2†, Laurence T. Yang1

1Huazhong University of Science and Technology

2Alibaba Group

3Nanjing University
* Equal Contribution     † Corresponding Author

Abstract

Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces \textbf{LaMP}, a novel \textbf{La}nguage-\textbf{M}otion \textbf{P}retraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP’s motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks.

Methodology

LaMP overview. We conduct joint training for contrastive learning, matching, and bidirectional text-motion translation by leveraging the textual features extracted from tokenized text descriptions via the text transformer and the motion features derived from the motion transformer.


LaMP-T2M and LaMP-M2T frameworks overview. (Left) Pretrained LaMP’s text transformer is employed to extract condition embedding and autoregressive mask prediction is performed. (Right) Finetuning an LLM to achieve motion captioning.

Experiments

Quantitative Results:



Qualitative Results:

BibTeX

@article{li2025lamp,
    title={LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning},
    author={Zhe Li and Weihao Yuan and Yisheng HE and Lingteng Qiu and Shenhao Zhu and Xiaodong Gu and Weichao Shen and and Yuan Dong and Zilong Dong and Laurence T. Yang},
    journal = {International Conference on Learning Representations (ICLR)}
    year={2024},
}