Visual Perception: A New Model

Summary: A new computational model reproduces the human visual system’s rapid ability to form detailed descriptions of a scene from a single image.

Source: MIT

When we open our eyes, we immediately perceive our surroundings in rich detail. Understanding how the brain builds such detailed representations so quickly remains a major question in vision science.

Researchers have long tried to model this rapid, detailed perception with computer vision systems, but most leading models can only perform narrower tasks—like locating a face or identifying an object against clutter. A team led by cognitive scientists at MIT has developed a new model that more closely mirrors the human visual system’s ability to generate a rich scene interpretation from a single image and offers hypotheses about how the brain performs this computation so rapidly.

“We aimed to explain how perception can be far more than mere semantic labeling of image parts and to explore how we perceive the physical structure of the world,” says Josh Tenenbaum, professor of computational cognitive science and a member of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds, and Machines (CBMM).

The model proposes that the brain, when it receives visual input, executes a sequence of computations that reverse the steps used by computer graphics programs to render a 2D image of a face or object. Known as efficient inverse graphics (EIG), this approach also aligns with neural recordings from face-selective brain regions in nonhuman primates, suggesting the primate visual system may be organized similarly to the model.

The study’s lead author is Ilker Yildirim, formerly a postdoc at MIT and now an assistant professor of psychology at Yale University. Senior authors include Josh Tenenbaum and Winrich Freiwald of Rockefeller University. Mario Belledonne, a Yale graduate student, is also an author. The paper appears in Science Advances.

Inverse graphics

Decades of vision research have revealed how light falling on the retina is transformed into coherent scene representations. That knowledge has guided artificial intelligence advances such as face and object recognition. “Vision is one of the brain’s best-understood functions, and computer vision is among the most successful areas of AI,” Tenenbaum notes. Yet current AI systems still fall short of the richness and speed of human perception.

“Our brains do more than detect and label objects,” Yildirim explains. “We perceive shapes, geometry, surfaces, and textures—we see a richly structured world.”

Hermann von Helmholtz proposed more than a century ago that the brain may create internal generative models and run them in reverse: an internal image generator could synthesize faces and scenes, and reversing that generator would allow the brain to infer the underlying structure that produced the retinal image. The challenge has been explaining how the brain performs this “inverse graphics” process within the roughly 100–200 milliseconds required for online perception.

Previous computational implementations of inverse graphics typically relied on many iterative processing cycles and were too slow to match biological speeds. Neuroscientists suspect the brain achieves rapid perception through a mostly feedforward pass across hierarchically organized neural layers.

The MIT-led team built a deep neural network that models how a layered neural hierarchy could quickly infer a scene’s underlying features—here focused on faces. Unlike standard deep networks trained on labeled categories, this network is trained using a generative model that reflects how the brain might internally represent faces and scenes.

The network learns to invert the sequence used by face graphics programs. Those programs start with a three-dimensional representation of a face, apply texture, lighting, and viewpoint, and produce a two-dimensional image that can be placed onto a background. The model reverses this: it begins with a 2D image, reconstructs intermediate “2.5D” representations that encode surface shape, curvature, and texture from a viewpoint, and then infers a 3D, viewpoint-invariant representation of the face.

“This approach provides a systems-level account of face processing in the brain, showing how an image can be transformed, through a 2.5D intermediate stage, into a full 3D representation that encodes shape and texture,” Yildirim says.

Model performance

The researchers compared the model’s internal stages to neural data from macaque monkeys. Earlier work had recorded neuron activity in face-selective regions in response to multiple faces shown from different viewpoints and identified three stages of processing. The MIT team finds that these three neural stages map onto their model’s top layers: a viewpoint-dependent 2.5D stage, a transitional stage bridging 2.5D to 3D, and a viewpoint-invariant 3D representation.

This shows the process from real face image to computer generated face — A computer model of face recognition reverses the steps of a graphics renderer to infer three-dimensional shape, texture, and lighting from a two-dimensional image. Image credit: MIT.

“Both the quantitative and qualitative response properties of those three neural stages align well with the top three levels of our network,” Tenenbaum says.

The team also evaluated how the model performs on human-like tasks, such as recognizing faces across different viewpoints and under manipulations like texture removal or shape distortion. The new model’s behavior more closely matched human performance than state-of-the-art face-recognition systems, indicating it may better capture key aspects of human perception.

Future work will test the inverse-graphics approach on broader image sets and object categories to determine whether the same principles can explain perception of non-face scenes. The researchers also believe that integrating this approach into computer vision could yield AI systems that see more richly and robustly.

“If evidence continues to support these models as biologically plausible, it could encourage computer vision researchers to invest more engineering effort in inverse-graphics approaches,” Tenenbaum says. “The brain remains the gold standard for machines that perceive the world richly and quickly.”

Funding: Center for Brains, Minds, and Machines at MIT; National Science Foundation; National Eye Institute; Office of Naval Research; New York Stem Cell Foundation; Toyota Research Institute; Mitsubishi Electric.

About this neuroscience research article

Source:
MIT
Media Contacts:
Sarah McDonnell – MIT
Image Source:
Image credited to MIT.

Original Research: Open access
“Efficient inverse graphics in biological face processing” — Ilker Yildirim, Mario Belledonne, Winrich Freiwald, and Josh Tenenbaum. Science Advances. DOI: 10.1126/sciadv.aax5979.

Abstract

Efficient inverse graphics in biological face processing

Vision not only detects and recognizes objects but also performs rich inferences about the underlying scene structure that gives rise to the light patterns we observe. Inverting generative models—known as “analysis-by-synthesis”—offers one solution, but prior mechanistic implementations have typically been too slow for online perception and their mapping to neural circuits has been unclear. Here we present a neurally plausible, efficient inverse graphics model and test it in the domain of face recognition. The model uses a deep neural network trained to invert a three-dimensional face graphics program in a single fast feedforward pass. It accounts for human behavior both qualitatively and quantitatively, including phenomena like the “hollow face” illusion, and it maps onto a specialized face-processing circuit in the primate brain. The model fits behavioral and neural data better than current state-of-the-art computer vision models and offers an interpretable, reverse-engineering account of how the brain converts images into percepts.

Feel Free To Share This Neurotech News.