Scaling World Model for Hierarchical Manipulation Policies

Long Qian*,1,2, Yueze Wang*,2, Jiaxi Song*,2,3, Junbo Zhang2,3, Peiyan Li2,5, Wenxuan Wang2,5, Yuqi Wang2,5, Haoyang Li2, Shaoxuan Xie2, Guocai Yao2, Hanbo Zhang4, Xinlong Wang2, Zhongyuan Wang2, Xuguang Lan†,1, Huaping Liu†,3, Xinghang Li†,‡,2,
1Xi'an Jiao Tong University 2Beijing Academy of Artificial Intelligence 3Tsinghua University 4National University of Singapore 5Institute of Automation, Chinese Academy of Sciences *denotes equal contribution

Overview

The overview of VISTA

We introduce VISTA, a hierarchical Vision-Language-Action framework that leverages the generalization of large-scale pre-trained world model for robust and generalizable Visual Subgoal TAsk decomposition. Our hierarchical framework consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios.

Pipeline

VISTA method

The pipeline of VISTA. (1) We first autoregressively generate subtasks and their corresponding goal images using an embodied world model based on initial observations and instructions. (2) Then, the GoalVLA predicts actions following the provided subgoals and automatically switches to the next subtask upon completion of the current one.

Embodied World Model Generation Capability

Our embodied world model can generate diverse cross-embodiment goal images, and produce logically consistent subtasks and physically plausible goal images from instructions even in unconstrained novel scenes. Furthermore, we explore its emerging capabilities, including multi-step planning for composite tasks and generation for spatial and semantic understanding tasks.

Cross embodiment

real_cross_single_col

Novel scenarios

banana_results_part2

Emerging capability

real_comp_all

Real-World Experiment

real_illustrate_single_col

We evaluate the methods over the in-domain and out-of-distribution novel scenarios. Our GoalVLA is trained with only two hours of real-world data collection on 5 objects, and successfully generalizes to 21 unseen objects and novel scenes by leveraging generated goal images.

In-domain Scenarios

Our approach demonstrates greater robustness than the baseline when encountering unseen distractors and unseen targets.

VISTA method

Novel Scenarios

In challenging novel scenarios, embodied world models can accurately generate instructive goal images, and GoalVLA can predict correct actions based on these guidances. Compared to the baseline relying solely on the language instruction guidance, our method achieves significant performance improvements.

performance

Widowx_zeroshot

execution comparison

Widowx_zeroshot

Emerging capability

Compositional tasks

case_compare

spatial understanding tasks

case_compare

semantic understanding tasks

case_compare