This will be a hybrid event with in-person attendance in Levine 307 and virtual attendance on Zoom.
The past few years have witnessed great success in video intelligence, as supercharged by multimodal models. In this talk, I will start with a brief sharing of our efforts, in building video-language models for understanding and diffusion models for video generation. Yet, video understanding and generation have always been two separate research pillars, despite their strong synergy. This motivates us to develop Show-o, one unified single transformer that can do both multimodal understanding and generation. Show-o is the first to unify autoregressive and discrete diffusion modeling, flexibly supporting a wide range of vision-language tasks of any input/output format, including visual question-answering, text-to-image/video generation, and generation of video keyframes with captions, all within one single 1.3B transformer. Show-o sheds light for building the next-generation multimodal video foundation model, and has sparked many follow-up works already.