We introduce VISTA, a hierarchical Vision-Language-Action framework that leverages the generalization of large-scale pre-trained world model for robust and generalizable Visual Subgoal TAsk decomposition. Our hierarchical framework consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios.
The pipeline of VISTA. (1) We first autoregressively generate subtasks and their corresponding goal images using an embodied world model based on initial observations and instructions. (2) Then, the GoalVLA predicts actions following the provided subgoals and automatically switches to the next subtask upon completion of the current one.
Our embodied world model can generate diverse cross-embodiment goal images, and produce logically consistent subtasks and physically plausible goal images from instructions even in unconstrained novel scenes. Furthermore, we explore its emerging capabilities, including multi-step planning for composite tasks and generation for spatial and semantic understanding tasks.
We evaluate the methods over the in-domain and out-of-distribution novel scenarios. Our GoalVLA is trained with only two hours of real-world data collection on 5 objects, and successfully generalizes to 21 unseen objects and novel scenes by leveraging generated goal images.
Our approach demonstrates greater robustness than the baseline when encountering unseen distractors and unseen targets.
In challenging novel scenarios, embodied world models can accurately generate instructive goal images, and GoalVLA can predict correct actions based on these guidances. Compared to the baseline relying solely on the language instruction guidance, our method achieves significant performance improvements.
performance
execution comparison