ABSTRACT
In this talk, I will discuss how to learn visual representation and common sense knowledge without using any manual supervision. First, I am going to discuss how we can learn ConvNets in a completely unsupervised manner using auxiliary tasks. Specifically, I am going to demonstrate how spatial context in images and viewpoint changes in videos can be used to train visual representations. Then, I am going to introduce NEIL (Never Ending Image Learner), a computer program that runs 24×7 to automatically build visual detectors and common sense knowledge from web data. NEIL is an attempt to develop a large and rich visual knowledge base with minimum human labeling effort. Every day, NEIL scans through images of our mundane world, and little by little, it learns common sense relationships about our world. For example, with no input from humans, NEIL can tell you that trading floors are crowded and babies have eyes. In eight months, NEIL has analyzed more than 25 million images, labeled ~4M annotations (boxes and segments), learned models for 7500 concepts and discovered more than 20K common sense relationships. Finally, in an effort to diversify the knowledge base, I will briefly discuss how NEIL is also being extended to a physical robot which learns about knowledge for actions.