What Your Brain Instantly Sees That AI Still Misses

Summary: New research shows the human brain automatically recognizes the actions an environment makes possible—such as walking, climbing, or swimming—without conscious effort. Using functional MRI, researchers identified distinct activity patterns in visual brain regions that go beyond simple object or color processing, revealing a neural representation of “affordances,” or possible actions offered by a scene.

When compared with a range of AI systems, including large language and vision models such as GPT-4, humans substantially outperformed machines at judging which actions a scene affords. The findings emphasize the tight coupling of perception and potential action in the brain and suggest important directions for AI development inspired by human cognition.

Key facts:

  • Automatic action mapping: The brain encodes possible actions (affordances) from visual scenes even when no action is explicitly requested.
  • Distinct neural signatures: Activity in scene-selective visual cortex reflects what you can do in a place, not only what objects or surfaces are visible.
  • AI gap: Current AI models, even advanced ones, do not fully match human judgments about environmental action opportunities.

Source: University of Amsterdam

How do we instantly know how to move through an unfamiliar scene? When we glance at a mountain trail, a crowded street, or a riverbank, we almost immediately perceive which movements are possible—whether to walk, cycle, swim, or stop. This study explains how the brain represents these possibilities and how those neural encodings differ from artificial models.

The work, led by computational neuroscientist Iris Groen with PhD student Clemens Bartnik and colleagues, examined how people evaluate locomotive affordances and then compared those human judgments and brain responses with the outputs and internal activations of deep neural networks and other AI models.

Measuring scene perception inside an MRI scanner

Participants viewed photographs of indoor and outdoor environments while undergoing functional MRI. For each image, they pressed a button to indicate whether the scene invited actions such as walking, cycling, driving, swimming, boating, or climbing. The researchers recorded the brain activity linked to those judgments to find how perceived action opportunities are represented neurally.

The central question was whether visual processing stops at identifying objects and textures, or whether it also encodes what actions the scene affords. Psychologists call these perceived possibilities “affordances”—for example, seeing a staircase and perceiving it as climbable, or spotting an open field and perceiving it as runnable.

Distinct processing in human visual cortex

Analysis of the MRI data revealed that specific areas of the scene-selective visual cortex show activation patterns that cannot be explained solely by low-level image features, object identities, or global scene categories. Instead, these patterns relate directly to perceived locomotive affordances.

Importantly, affordance-related activity emerged even when participants were not explicitly instructed to think about possible actions. In other words, the brain registers action opportunities automatically as part of visual perception. This provides evidence that affordances are not only a psychological construct but also a measurable property of cortical representation.

Where AI still falls short

The study evaluated a range of AI systems—including object and scene classification networks and large language models—by comparing their outputs and internal feature activations to human behavioral annotations and fMRI patterns. While models trained specifically to recognize actions could approximate certain human judgments, the overall alignment with human perception and neural representations was limited.

Even advanced models such as GPT-4 did not fully capture the human way of linking perception to action. The researchers note that machine models typically operate on visual patterns or textual co-occurrences without the embodied experience humans bring to perception. As a result, AI can identify objects and surfaces but struggles to infer the same set of practical action affordances that people perceive intuitively.

Implications for AI and human-centered design

These results bear on designing safer, more efficient, and more adaptable AI systems. For domains like robotics, autonomous vehicles, and rescue operations, systems must do more than label objects—they must infer which actions are possible or safe in a scene. Understanding how the human brain encodes affordances could inspire models that better integrate perception with action reasoning.

The authors also stress sustainability and accessibility concerns: current AI training often demands massive computational resources available primarily to large companies. Insights from the brain’s efficient, rapid processing could guide the development of AI that is both more economical to train and more robust in real-world interaction tasks.

About this AI research news

Author: Laura Erdtsieck
Source: University of Amsterdam
Contact: Laura Erdtsieck – University of Amsterdam
Image: Image credited to Neuroscience News

Original research: Closed access. “Representation of locomotive action affordances in human behavior, brains, and deep neural networks” by Iris Groen et al., published in PNAS (DOI: 10.1073/pnas.2414005122). The study compares human behavioral annotations, fMRI responses in scene-selective visual cortex, and deep neural network activations to show that locomotive action affordances are represented in the human visual system and only partially captured by current computational models.


Abstract (concise)

To navigate the world, we must identify which locomotive actions—walking, swimming, climbing, and so on—are afforded by a visual scene. This study links behavioral annotations, fMRI measurements, and deep neural network activations on the same images to demonstrate that the human visual cortex encodes locomotive affordances in complex scenes. Behavioral clustering shows people group environments into distinct affordance categories across multiple dimensions. Multivoxel representational analyses indicate that these affordance representations are independent of other scene properties such as objects, materials, or global scene categories, and are present regardless of the specific task performed in the scanner. Visual features from standard DNNs correlate more strongly with object representations than with affordance representations. Training networks on affordance labels or using affordance-centered language embeddings improves alignment with human behavior, but none of the tested models fully captures human affordance perception. These findings reveal a neural representation that reflects locomotive action affordances in humans.