This is a virtual event with virtual attendance on Zoom.
Foundation models that connect vision and language have recently shown great promise for a wide array of tasks such as text-to-image generation. Significant attention has been devoted towards utilizing the visual representations learned from these powerful vision and language models. In this talk, I will present an ongoing line of research that focuses on the other direction, aiming at understanding what knowledge language models acquire through exposure to images during pretraining. We first consider in-distribution text and demonstrate how multimodally trained text encoders, such as that of CLIP, outperform models trained in a unimodal vacuum, such as BERT, over tasks that require implicit visual reasoning. Expanding to out-of-distribution text, we address a phenomenon known as sound symbolism, which studies non-trivial correlations between particular sounds and meanings across languages, and demonstrate the presence of this phenomenon in vision and language models such as CLIP and Stable Diffusion. Our work provides new angles for understanding what is learned by these vision and language foundation models, offering principled guidelines for designing models for tasks involving visual reasoning.