The speaker appeared virtually, with live attendance in Wu and Chen and virtual attendance…
Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. In this talk, I will demonstrate how we can equip neural networks with inductive biases that enable them to learn 3D geometry, appearance, and even semantic information, self-supervised only from posed images. I will show how this approach unlocks the learning of priors, enabling 3D reconstruction from only a single posed 2D image. I will then talk about a recent application of self-supervised scene representation learning in robotic manipulation, where it enables us to learn to manipulate classes of objects in unseen poses from only a handful of human demonstrations, as well as the application of neural rendering to learn latent spaces amenable to control. I will then discuss recent work on learning the neural rendering operator to make rendering and training fast, and how this speed-up enables us to learn object-centric neural scene representations, learning to decompose 3D scenes into objects, given only images. Finally, I will discuss how neural scene representations may offer a new angle to tackle challenges in robotics.