Abstract: The human visual system is extremely efficient and good at perceiving and understanding the meaning of the visual world. This includes object recognition, scene classification, image segmentation, motion analysis, and many more tasks. Computer vision research has come a long way towards solving these high-level visual recognition tasks. In this talk, I will focus on two topics. First, we discuss a progression of recent research projects in our lab towards high-level image understanding beyond isolated objects. Using a probabilistic learning and recognition framework, we introduce a new algorithm that is capable of performing simultaneous object segmentation, scene annotation and event classification. We show that learning can be done in a fully automatic way by using Flickr images and the highly noisy user tags. Second, we argue that many high-level visual recognition tasks involve the understanding of a pivotal object: humans. In addition to classifying human actions, we present an automatic detection and extraction algorithm for carving out moving humans in arbitrary motions in YouTube videos. We also briefly discuss an activity recognition algorithm based on the understanding of human-object interactions, such as humans playing musical instruments. Finally, if time allows, I will introduce the newly released ImageNet database, a freely accessible image ontology containing millions of human labeled, full-resolution images organized according to the WordNet hierarchy.