This will be a hybrid event with in-person attendance in Levine 512 and virtual attendance on Zoom. This seminar will NOT be recorded.
Dynamic scene reconstruction from monocular cameras often requires us to simultaneously estimate depth and 3D motion, where knowledge of either one would help to constrain the other. I will review two different approaches to resolving this problem, describe their relative merits, and discuss what they tell us (if anything!) about this fundamental problem. The first approach uses active illumination to augment the scene via continuous-wave time of flight measurements. I will explain how this additional depth input only superficially helps—itself introducing new problems—and how we can resolve them using self-supervision and physically-based rendering. This lets us reconstruct objects under fast motion like swinging baseball bats (ECCV 2024, ongoing work, and arXiv). The second uses supervised learning to directly predict depth and 3D scene flow from two RGB images only. Here, providing generalization is the key challenge, where factors like motion parameterization and data scaling are critical. Careful empirical work lets us use a single feed-forward neural network to predict depth and motion for casual videos, robotic manipulation videos, and autonomous driving videos (arXiv). Finally, after all that hard work, if there’s time then I’ll show some pretty pictures of shiny objects, because who doesn’t like those (SIGGRAPH Asia 2024)?!