Scaling World Model for Hierarchical Manipulation Policies

Long Qian^*,1,2, Yueze Wang^*,2, Jiaxi Song^*,2,3, Junbo Zhang^2,3, Peiyan Li^2,5, Wenxuan Wang^2,5, Yuqi Wang^2,5, Haoyang Li², Shaoxuan Xie², Guocai Yao², Hanbo Zhang⁴, Xinlong Wang², Zhongyuan Wang², Xuguang Lan^†,1, Huaping Liu^†,3, Xinghang Li^†,‡,2,

¹Xi'an Jiao Tong University ²Beijing Academy of Artificial Intelligence ³Tsinghua University ⁴National University of Singapore ⁵Institute of Automation, Chinese Academy of Sciences ^*denotes equal contribution

Paper Code Model (coming soon)

Overview

We introduce VISTA, a hierarchical Vision-Language-Action framework that leverages the generalization of large-scale pre-trained world model for robust and generalizable Visual Subgoal TAsk decomposition. Our hierarchical framework consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios.

Pipeline

The pipeline of VISTA. (1) We first autoregressively generate subtasks and their corresponding goal images using an embodied world model based on initial observations and instructions. (2) Then, the GoalVLA predicts actions following the provided subgoals and automatically switches to the next subtask upon completion of the current one.

Embodied World Model Generation Capability

Our embodied world model can generate diverse cross-embodiment goal images, and produce logically consistent subtasks and physically plausible goal images from instructions even in unconstrained novel scenes. Furthermore, we explore its emerging capabilities, including multi-step planning for composite tasks and generation for spatial and semantic understanding tasks.

Cross embodiment

Novel scenarios

Emerging capability

Real-World Experiment

We evaluate the methods over the in-domain and out-of-distribution novel scenarios. Our GoalVLA is trained with only two hours of real-world data collection on 5 objects, and successfully generalizes to 21 unseen objects and novel scenes by leveraging generated goal images.

In-domain Scenarios

Our approach demonstrates greater robustness than the baseline when encountering unseen distractors and unseen targets.

Novel Scenarios

In challenging novel scenarios, embodied world models can accurately generate instructive goal images, and GoalVLA can predict correct actions based on these guidances. Compared to the baseline relying solely on the language instruction guidance, our method achieves significant performance improvements.

performance