ABSTRACT
Training robust deep video representations has proven to be much more challenging than learning deep image representations. The two main reasons are: videos are large, and annotating video data is hard.
Raw video streams are huge and highly redundant. The ‘true’ and interesting signal often drowns in too much irrelevant data. In this talk, I will show how to train a deep network directly on the compressed video (like H.264, HEVC, etc.), devoid of redundancy, rather than the traditional highly redundant RGB stream. This representation has a higher information density and is easier to train.
Like image representations, deep video representations are extremely hungry for data. Unlike images, videos are much harder to annotate due to their sheer size. This limits the tasks we consider in videos to primarily global labeling tasks, such as action recognition. Most other tasks require too much manual annotation. In the second part of this talk, I’ll present an alternative to this, and show how ground truth instance segmentation, semantic labels, depth estimation, optical flow, intrinsic image decomposition, and instance tracking are easily extracted from video games in real time as we play the game.