This seminar was held in person in Wu and Chen Auditorium as well as virtually…
In the domain of image and video analysis, much of the deep learning revolution has been focused on narrow, high-level classification tasks that are defined through carefully curated, retrospective data sets. However, most real-world applications – particularly those involving complex, multi-step manipulation activities — occur “in the wild” where there is a combinatorial long tail of unique situations that are never seen during training. These systems demand a richer, fine-grained task representation that is informed by the application context and which supports quantitative analysis and compositional synthesis. As a result, the challenges inherent in both high-accuracy, fine-grained analysis and performance of perception-based activities are manifold, spanning representation, recognition, and task and motion planning.
This talk will summarize our work addressing these challenges. I’ll first describe DASZL, our approach to interpretable, attribute-based activity detection. DASZL operates in both pre-trained and zero shot settings, and it has been applied to a variety of applications ranging from surveillance to surgery. I will then describe our recent work on “Good Robot”, a method for end-to-end training of a robot manipulation system. Good Robot achieves state-of-the-art performance in complex, multi-step manipulation tasks, and we show it can be refactored to support both demonstration-driven and language-guided manipulation. I’ll close with a summary of some directions related to these technologies that we are currently exploring.