Generating Human Motion in 3D Scenes from Text Descriptions
Generates human motions in 3D indoor scenes from text descriptions through language grounding and object-centric motion generation.
The paper proposes a novel approach for generating human motions in 3D indoor scenes based on text descriptions. The key idea is to decompose the problem into two sub-problems: language grounding of objects in scenes and object-centric motion generation. To ground objects in scenes, the authors leverage large language models (LLMs) by formulating the task as question answering. They construct scene graphs from 3D scenes and use ChatGPT to analyze the relationship between scene descriptions and input instructions to identify target objects accurately.
For object-centric motion generation, the authors design an object-centric representation to focus on the target object. They convert point clouds around the target object into volumetric sensors and employ diffusion models to synthesize human motions based on this representation and text descriptions. This object-centric representation reduces scene complexity, making it easier to model the relationship between human motions and objects. The method is implemented in a two-stage process, first localizing the target object and then generating motions with a focus on the target object.
Comments
None