This was a hybrid event with in-person attendance in Levine 307 and virtual attendance…
Today’s machine perception systems rely extensively on supervision provided by humans, such as natural language. I will talk about our efforts to make systems that, instead, learn from two ubiquitous sources of unlabeled sensory data: visual motion and cross-modal associations between senses. First, I will discuss our work on creating unified self-supervised motion analysis methods that can address both object tracking and optical flow tasks. I will then discuss how these same techniques can be applied to localizing sound sources in video, and how tactile sensing data can be used to train multimodal visual-tactile models. Finally, I will talk about our recent work on subverting visual perception systems, by creating “multi-view” optical illusions: images that change their appearance under a transformation, such as a flip or rotation.