Modeling common-sense scene understanding with probabilistic programs

Modeling common-sense scene understanding with probabilistic programs J Tenenbaum
Massachusetts Institute of Technology, Cambridge, MA, United States
To see is, famously, to ''know what is where by looking''. Yet to see is also to know what will happen, what can be done, and what is being done -- to detect not only objects and their locations, but the physical dynamics governing how objects in the scene interact with each other and how agents can act on them, and the psychological dynamics governing how intentional agents in the scene interact with these objects and each other to achieve their goals. I will talk about recent efforts to capture these core aspects of human common-sense scene understanding in computational models that can be compared with the judgments of both adults and young children in precise quantitative experiments, and used for building more human-like machine vision systems. These models of intuitive physics and intuitive psychology take the form of "probabilistic programs": probabilistic generative models defined not over graphs, as in many current machine learning and vision models, but over programs whose execution traces describe the causal processes giving rise to the behavior of physical objects and intentional agents. Common-sense physical and psychological scene understanding can then be characterized as approximate Bayesian inference over these probabilistic programs. Specifically, we embed several standard algorithms -- programs for fast approximate graphics rendering from 3D scene descriptions, fast approximate physical simulation of rigid body dynamics, and optimal control of rational agents (including state estimation and motion planning) -- inside a Monte Carlo inference framework, which is capable of inferring inputs to these programs from observed partial outputs. We show that this approach is able to solve a wide range of problems including inferring scene structure from images, predicting physical dynamics and inferring latent physical attributes from static images or short movies, and reasoning about the goals and beliefs of agents from observations of short action traces. We compare these solutions quantitatively with human judgments, and with the predictions of a range of alternative models. How these models might be implemented in neural circuits remains an important and challenging open question. Time permitting, I will speculate briefly on how it might be addressed. This talk will cover joint work with Peter Battaglia, Jess Hamrick, Chris Baker, Tomer Ullman, Tobi Gerstenberg, Kevin Smith, Ed Vul, Eyal Decther, Vikash Mansinghka, Tejas Kulkarni, and Tao Gao.

Up Home