EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

1Tsinghua University    2Alibaba Group   
3Nanyang Technological University    4Xi'an Jiaotong University

*Project Lead     Corresponding Author

Overview

Why do state-of-the-art video generation models often fail on complex human actions? While recent models excel at generating visually stunning scenes, they frequently struggle to synthesize plausible and coherent human movements. The reason lies in their training objective: by focusing solely on pixel-level fidelity, these models learn to mimic appearances but fail to grasp the underlying kinematic principles of human articulation. This leads to artifacts like floating feet and unnatural limb movements.

To overcome this, we introduce EchoMotion, a new framework that fundamentally changes the learning paradigm. Instead of treating video as just a sequence of pixels, we propose to jointly model appearance and the explicit human motion that drives it. Our core idea is that by providing the model with a clear understanding of kinematic laws, we can significantly improve the coherence and realism of generated human-centric videos. EchoMotion is designed to learn the joint distribution of what we see (appearance) and how it moves (motion).

Method

At the heart of EchoMotion is a dual-branch Diffusion Transformer (DiT) architecture that processes visual and motion information in parallel. This design allows the model to learn rich cross-modal correlations.

Our key technical contributions include:

EchoMotion Framework
  • Dual-Stream Architecture with Motion Latents: We first encode raw SMPL motion data into a compact latent representation. These motion tokens are then concatenated with visual tokens from the video frames. Our dual-stream DiT processes this combined sequence, enabling deep fusion of appearance and kinematic information within its attention layers.
  • MVS-RoPE (Motion-Video Synchronized Positional Encoding): How can the model understand that a specific motion corresponds to a specific video frame? We introduce MVS-RoPE, a novel unified 3D positional encoding scheme. It provides a shared coordinate system for both video and motion tokens, creating a powerful inductive bias that encourages temporal alignment between the two modalities. This ensures that the generated motion perfectly syncs with the visual output.
MVS-RoPE
  • Motion-Video Two-Stage Training Strategy: We design a flexible two-stage training strategy. In the first stage, the model learns to generate motion from text. In the second stage, it learns to generate video conditioned on both text and the pre-trained motion priors. This strategy not only enables joint generation but also unlocks versatile capabilities like motion-to-video and video-to-motion translation, all within a single unified model.

The HuMoVe Dataset

Training a model like EchoMotion requires a large-scale, high-quality dataset of paired video and motion data. To this end, we constructed HuMoVe, the first dataset of its kind, containing approximately 80,000 video-motion pairs.

HuMoVe Dataset Samples

Highlights:

  • Wide Category Coverage: HuMoVe spans a diverse range of human activities, from daily actions and sports to complex dance performances, ensuring our model learns a robust and generalizable representation of human motion.
  • High-Quality Annotations: Each video is accompanied by both a detailed textual description and a precise SMPL motion sequence, extracted using state-of-the-art motion capture techniques.
  • High-Fidelity Videos: We meticulously curated high-resolution, clean video clips, free from major occlusions or distracting backgrounds, providing an ideal training ground for generation models.

Results

Text to Joint Video-and-Motion Generation

Quantitative Comparison

We first evaluate EchoMotion on generating a video and its corresponding motion sequence simultaneously from a text prompt. Compared to baseline models that only generate video, EchoMotion produces significantly more coherent and physically plausible human movements.

Quantitative Comparison Table

Visual Comparison

"The video shows a skateboarder quickly sliding down a stair railing, maintaining balance throughout, with bold and smooth movements."

"A gymnast is stretching on the mat before training, lifting her legs over her head and showing off her incredible flexibility."

Wan5B

EchoMotion(Wan5B)

Motion-to-Video Generation

Thanks to our versatile framework, EchoMotion can also generate a video conditioned on a given motion sequence (e.g., from a motion capture file). This allows for precise control over the generated character's actions.

Video-to-Motion Prediction

As a reverse task, EchoMotion can predict the underlying 3D motion sequence from an input video. This showcases the model's deep understanding of the relationship between appearance and kinematics.

BibTeX

@article{yang2024echomotion,
    title={EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer},
    author={Yang, Yuxiao and Sheng, Hualian and Cai, Sijia and Lin, Jing and Wang, Jiahao and Deng, Bing and Lu, Junzhe and Wang, Haoqian and Ye, Jieping},
    journal={arXiv preprint arXiv:2512.18814},
    year={2024}
  }