UMO: Unified In-Context Learning Unlocks
Motion Foundation Model Priors

One Unified Framework · Three Meta-Operations · Diverse Motion Tasks

Xiaoyan Cong1*, Zekun Li1*, Zhiyang Dou2, Hongyu Li1, Omid Taheri3, Chuan Guo4, Abhay Mittal4, Sizhe An4, Taku Komura5, Wojciech Matusik2, Michael J. Black3, Srinath Sridhar1

1Brown University   2Massachusetts Institute of Technology   3Max-Planck Institute for Intelligent Systems
4Meta Reality Lab   5The University of Hong Kong

UMO Teaser: diverse motion tasks as compositions of atomic operations
Radar chart comparing UMO across tasks and metrics

Text-to-Motion Generation

Generating realistic human motions from natural language descriptions.

Temporal Inpainting

Keyframe infilling, prediction, backcasting, and in-betweening.

Keyframe (given) Generated (ours)

Instruction-Based Motion Editing

Text-guided motion editing.

Trajectory Following

Follow geometric trajectories while maintaining natural motion.

{"type":"circular_arc", "start":[0.0, 0.0], "end":[2.92, 5.35], "center":[2.29, 2.22], "radius":3.19, "direction":"clockwise"}

{"type":"cubic_bezier","params":{"start":[0.0,0.0],"end":[3.54,4.15],"P0":[0.0,0.0],"P1":[-1.02,3.24],"P2":[4.55,0.92],"P3":[3.54,4.15]}}

{"type":"cubic_bezier","params":{"start":[0.0,0.0],"end":[3.98,2.03],"P0":[0.0,0.0],"P1":[0.47,2.34],"P2":[3.52,-0.31],"P3":[3.98,2.03]}}

Obstacle Avoidance

Navigate from point A to point B while avoiding obstacles.

A person walks from (0.00, 0.00) to (3.67, 5.36). Avoiding 2 obstacles at (0.71, 1.09, r=0.25), (2.86, 4.31, r=0.35), where r is the safety radius in meters.

A person walks from (-0.00, 0.00) to (3.28, 6.41). Avoiding 3 obstacles at (0.72, 3.65, r=0.43), (1.46, 3.98, r=0.42), (0.32, 2.93, r=0.24), where r is the safety radius in meters.

Dual-Identity Reaction Generation

Two-person interaction generation — entirely absent from single-person pretraining.

Source Generated

Architecture Ablation

Comparing four conditioning architectures for in-context feature integration.