Masked Visual Pre-training for Motor Control

*, †: Equal contribution

UC Berkeley

Masked Visual Pre-training for Motor Control

We show that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same visual representations are used for all motor control tasks. To the best of our knowledge, this is the first self-supervised model to exploit real-world images at scale for motor control. To accelerate progress in learning from pixels, we contribute a benchmark suite of hand-designed tasks varying in movements, scenes, and robots. Without relying on labels, state-estimation, or expert demonstrations, we consistently outperform supervised encoders by up to 80% absolute success rate, sometimes even matching the oracle state performance. We also find that in-the-wild images, e.g., from YouTube or Egocentric videos, lead to better visual representations for various manipulation tasks than ImageNet images.


Masked visual pre-training for motor control. Left: We first pre-train visual representations using self-supervision through masked image modeling from real-world images. Right: We then freeze the image encoder and train task-specific controllers on top with reinforcement learning (RL). The same visual representations are used for all motor control tasks.

Outperforms Supervised Pre-training

We plot the success rate as a function of environment steps on the 8 PixMC tasks. Each task uses either the Franka arm with a parallel gripper or the Kuka arm with a multi-finger hand. The MVP approach significantly outperforms the supervised baseline on 7 tasks and closely matches the oracle state model (considered the upper bound of RL) on 5 tasks at convergence. The result shows that self-supervised pre-training markedly improves representation quality for motor control tasks.

Disentangles Shape and Color

Generalizes to Different Objects


  title={Masked Visual Pre-training for Motor Control},
  author={Tete Xiao and Ilija Radosavovic and Trevor Darrell and Jitendra Malik},
  journal={arXiv preprint arXiv:2203.06173},


We thank William Peebles, Matthew Tancik, Anastasios Angelopoulos, Aravind Srinivas, and Agrim Gupta for helpful discussions. We thank the NVIDIA Isaac Gym team for the simulator. This work was supported in part by DOD including DARPA's MCS, XAI, LwLL, and/or SemaFor programs; ONR MURI program (N00014-14-1-0671), as well as BAIR's industrial alliance programs.