This was a hybrid event with in-person attendance in Levine 307 and virtual attendance…
What kind of representation do robots need in order to be as generally capable as humans in handling unseen scenarios? Recent work in vision and vision-language foundation models has become quite good at telling what is in a scene, but they do not capture the geometry needed for handling physical contact. State-of-the-art methods in inverse graphics capture detailed 3D geometry, but they are missing the semantics. In this talk, I will present a way to combine accurate 3D geometry with rich semantics into a single representation format called distilled feature fields and ways to use this representation for perception during few-shot manipulation with a robotic arm. Using features sourced from the vision-language model, CLIP, our method allows the user to designate novel objects for manipulation via free-text natural language, and can generalize to unseen expressions and novel categories of objects. I will also present ways to scale feature fields up for building maps and the dual purpose of building realistic physics simulators for reinforcement learning. Finally, I will present our recent effort in building a unified representation for semantics, geometry, and physics called Feature Splatting.