This was a hybrid event with in-person attendance in Wu and Chen and virtual attendance…
In recent years, the rise of Large Language Models (LLMs) with advanced general capabilities has paved the way towards building language-guided agents that can perform complex, multi-step tasks on behalf of users, much like human assistants. Building agents that can perceive, plan, and act autonomously has long been a central goal of artificial intelligence research. In this talk I will introduce Multimodal AI agents capable of planning, reasoning, and executing actions on the web, that can not only comprehend textual information but also effectively navigate and interact with visual settings I will next present an inference-time search algorithm for agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. Finally, I will introduce VisualWebArena, a novel framework for evaluating multimodal autonomous language agents, and offer insights towards building stronger autonomous agents for both digital and physical environments.