GMD: Guided Motion Diffusion Models

Abstract

Denoising diffusion models have shown great promise in human motion synthesis conditioned on natural language descriptions. However, it remains a challenge to integrate spatial constraints, such as pre-defined motion trajectories and obstacles, which is essential for bridging the gap between isolated human motion and its surrounding environment.

To address this issue, we propose Guided Motion Diffusion (GMD), a method that incorporates spatial constraints into the motion generation process. Specifically, we propose an effective feature projection scheme that largely enhances the coherency between spatial information and local poses. Together with a new imputation formulation, the generated motion can reliably conform to spatial constraints such as global motion trajectories.

Furthermore, given sparse spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance approach that utilizes the denoiser of diffusion models to turn a sparse signal into denser signals, effectively guiding the generation motion to the given constraints.

The extensive experiments justify the development of GMD, which achieves a significant improvement over state-of-the-art methods in text-based motion generation while being able to control the synthesized motions with spatial constraints.

Result Video

Summary

Problems

To make the synthesized motions useful, we need a way to ground them in the 3D world. However, existing works synthesize motion based on text alone without control over where the character is heading to
There are two main issues when trying to condition the motion diffusion model on a given spatial objective that makes the model likely to ignore the conditioning: the sparseness of location in the motion representation and sparse guidance signals
The sparseness of the guidance signals means the guidance will likely be ignored during the backward step. Why? Because it is easier to change a few values to make them coherent with the rest than the other way around
Sparseness in the representation: In each frame, the global location only consists of 4 values out of 263 values in the motion representation. The model is likely to change the global location to match the local pose than vice-versa due to the small importance of the global location
Sparse guidance signals: Additionally, the spatial guidance is also sparse as they are only defined on a few keyframes

Our Solutions

Motion-specific, UNET-based models that noticeably improve the base text-to-motion performance.
Emphasis projection allows us to manipulate feature importance before training. We give more emphasis to the global locations which are underrepresented in the motion vectors, making it harder for the model to ignore them.
Dense gradient propagation exploits the nature of denoising functions to propagate gradient at a specific frame to its neighboring frames, making the gradient from guidance dense in the frame dimension

Our Pipeline

In GMD, we tackle the problem of spatially conditioned motion generation using a two-staged pipeline (a) The optional first stage generates a trajectory given spatial conditioning. Then, the second stage synthesizes motions accroding to the trajectory. Our main contributions are (b) Emphasis projection, for better trajectory-motion coherence, and (c) Dense signal propagation, for a more controllable generation even under sparse guidance signal.

Emphasis projection

The most straightforward way of forcing the model to put more importance on the trajectory is to simply scale up its loss weight compared to other parts during training. Nevertheless, we found that loss manipulation is ineffective in this setting and also makes the training less stable. To achieve the same effect, we can scale up the value of the trajectory in the motion representation (multiply the values by c). But, doing so means the variance of each value in the motion vector will not be the same, which most diffusion models assume to be the case. The emphasis projection is then the next most straightforward solution for adjusting trajectory importance while taking these problems into account. It can be described as follows:

Scale up the trajectory values by a factor c.
Randomly project the motion vector into a new vector using a fixed random matrix. Do the same for every frame.
Normalize the new motion vector such that it has a unit variance again.

The final vectors are used as motion representation instead of the original representation.

Dense signal propagation

Our goal is to convert the sparse guidance signals that are defined on some keyframes to denser signals during denoising. As the spatial guidance can only be defined on the clean motion (X₀) but we need gradients w.r.t. the noisy motion (X_t) to guide the motion at the denoising step t, we observe that

We need a function that can propagate gradients that are defined on some frames of X₀ to their surrounding frames but with respect to X_t (not X₀ !)
By definition, any denoising function (f: x_noisy → x_clean) that produces a clean motion from noisy motion has this dense gradient propagation property because it needs to look at the context to refine any given frame (let x_noisy=x_t and x_clean=x₀).
This coincides with the definition of the diffusion model x_0,θ which predicts the clean motion X₀ from the noisy motion X_t.

As such, we can use the diffusion model itself to efficiently compute the gradients. In practice, this translates to using autodiff to compute the guidance gradients w.r.t. the input of the diffusion model.

Generated motion trajectories, conditioned on target locations at given keyframes. Without dense signal propagation, the model ignores the target conditions

BibTeX

@inproceedings{karunratanakul2023gmd,
  title={Guided Motion Diffusion for Controllable Human Motion Synthesis},
  author={Karunratanakul, Korrawe and Preechakul, Konpat and Suwajanakorn, Supasorn and Tang, Siyu},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2151--2162},
  year={2023}
}

GMD: Guided Motion Diffusion for Controllable Human Motion Synthesis

ICCV 2023