GMD: Guided Motion Diffusion for Controllable Human Motion Synthesis

1Department of Computer Science, ETH Zurich
2VISTEC, Thailand

ICCV 2023

Teaser image.

Guided Motion Diffusion (GMD) model can synthesize realistic human motion according to a text prompt, a reference trajectory, and key locations, as well as avoiding hitting your toe on giant X-mark circles that someone dropped on the floor. No need to retrain diffusion models for each of these tasks!


Denoising diffusion models have shown great promise in human motion synthesis conditioned on natural language descriptions. However, it remains a challenge to integrate spatial constraints, such as pre-defined motion trajectories and obstacles, which is essential for bridging the gap between isolated human motion and its surrounding environment.

To address this issue, we propose Guided Motion Diffusion (GMD), a method that incorporates spatial constraints into the motion generation process. Specifically, we propose an effective feature projection scheme that largely enhances the coherency between spatial information and local poses. Together with a new imputation formulation, the generated motion can reliably conform to spatial constraints such as global motion trajectories.

Furthermore, given sparse spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance approach that utilizes the denoiser of diffusion models to turn a sparse signal into denser signals, effectively guiding the generation motion to the given constraints.

The extensive experiments justify the development of GMD, which achieves a significant improvement over state-of-the-art methods in text-based motion generation while being able to control the synthesized motions with spatial constraints.

Result Video



  • To make the synthesized motions useful, we need a way to ground them in the 3D world. However, existing works synthesize motion based on text alone without control over where the character is heading to
  • There are two main issues when trying to condition the motion diffusion model on a given spatial objective that makes the model likely to ignore the conditioning: the sparseness of location in the motion representation and sparse guidance signals
  • The sparseness of the guidance signals means the guidance will likely be ignored during the backward step. Why? Because it is easier to change a few values to make them coherent with the rest than the other way around
  • Sparseness in the representation: In each frame, the global location only consists of 4 values out of 263 values in the motion representation. The model is likely to change the global location to match the local pose than vice-versa due to the small importance of the global location
  • Sparse guidance signals: Additionally, the spatial guidance is also sparse as they are only defined on a few keyframes
Guidance problem.

Our Solutions

  • Motion-specific, UNET-based models that noticeably improve the base text-to-motion performance.
  • Emphasis projection allows us to manipulate feature importance before training. We give more emphasis to the global locations which are underrepresented in the motion vectors, making it harder for the model to ignore them.
  • Dense gradient propagation exploits the nature of denoising functions to propagate gradient at a specific frame to its neighboring frames, making the gradient from guidance dense in the frame dimension
Guidance problem.

Our Pipeline

In GMD, we tackle the problem of spatially conditioned motion generation using a two-staged pipeline (a) The optional first stage generates a trajectory given spatial conditioning. Then, the second stage synthesizes motions accroding to the trajectory. Our main contributions are (b) Emphasis projection, for better trajectory-motion coherence, and (c) Dense signal propagation, for a more controllable generation even under sparse guidance signal.

Method pipeline.

Emphasis projection

The most straightforward way of forcing the model to put more importance on the trajectory is to simply scale up its loss weight compared to other parts during training. Nevertheless, we found that loss manipulation is ineffective in this setting and also makes the training less stable. To achieve the same effect, we can scale up the value of the trajectory in the motion representation (multiply the values by c). But, doing so means the variance of each value in the motion vector will not be the same, which most diffusion models assume to be the case. The emphasis projection is then the next most straightforward solution for adjusting trajectory importance while taking these problems into account. It can be described as follows:

  1. Scale up the trajectory values by a factor c.
  2. Randomly project the motion vector into a new vector using a fixed random matrix. Do the same for every frame.
  3. Normalize the new motion vector such that it has a unit variance again.
The final vectors are used as motion representation instead of the original representation.

Dense signal propagation

Our goal is to convert the sparse guidance signals that are defined on some keyframes to denser signals during denoising. As the spatial guidance can only be defined on the clean motion (X0) but we need gradients w.r.t. the noisy motion (Xt) to guide the motion at the denoising step t, we observe that

  1. We need a function that can propagate gradients that are defined on some frames of X0 to their surrounding frames but with respect to Xt (not X0 !)
  2. By definition, any denoising function (f: xnoisy → xclean) that produces a clean motion from noisy motion has this dense gradient propagation property because it needs to look at the context to refine any given frame (let xnoisy=xt and xclean=x0).
  3. This coincides with the definition of the diffusion model x0,θ which predicts the clean motion X0 from the noisy motion Xt.
As such, we can use the diffusion model itself to efficiently compute the gradients. In practice, this translates to using autodiff to compute the guidance gradients w.r.t. the input of the diffusion model.

Guidance problem.

Generated motion trajectories, conditioned on target locations at given keyframes. Without dense signal propagation, the model ignores the target conditions


  title={Guided Motion Diffusion for Controllable Human Motion Synthesis},
  author={Karunratanakul, Korrawe and Preechakul, Konpat and Suwajanakorn, Supasorn and Tang, Siyu},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},