Stroop Test Exposes Critical Weakness in LLMs

Summary: A new cognitive evaluation of artificial intelligence reveals a systemic weakness in transformer-based attention. Researchers administered the classic Stroop task to leading large language models (LLMs) — including GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 — and found that machine attention collapses under increasing sequence length, producing near-total failures when the models must inhibit their dominant text-reading impulse.

Where human cognition sustains top-down control to suppress automatic responses across long sequences, transformer attention shows a dramatic, length-dependent decline in executive control. When models are required to report ink color instead of reading the color word, accuracy drops sharply as the number of items grows, exposing a structural limitation in current LLM architectures.

Key facts

The audit and its goal: Led by Suketu Patel and collaborators, the study compared transformer-based machine attention to human executive attention using the Stroop task — a clinical test in which color words are printed in incongruent colored ink and participants must name the ink color while ignoring the word meaning. The task specifically measures inhibitory control and conflict resolution.
Length-dependent collapse: LLMs can perform well on short, incongruent lists, but their ability to inhibit an automatic reading response deteriorates rapidly as list length increases. Short lists (five items) produced solid accuracy, while longer lists produced catastrophic drops in performance.
Measured degradation in frontier models: The authors report concrete thresholds for failure. For example, GPT-4o achieved 91% accuracy on five-word incongruent lists, fell to 57% at ten words, and collapsed to 15% at forty words. Claude 3.5 Sonnet remained relatively stable through twenty words but dropped to 24% accuracy at forty words.
Mixed-list breakdown: When lists contained a mixture of congruent and incongruent items, performance on the incongruent items worsened dramatically, in some cases approaching near-zero accuracy, indicating that mixed contexts are especially challenging for transformer attention.
Next-generation models affected: The same pattern of length-dependent failure was observed in newer systems, including GPT-5, Claude Opus 4.1, and Gemini 2.5, showing that this is a pervasive architectural issue rather than a quirk of older models.
Biological versus synthetic attention: Both humans and LLMs are better trained on reading words than naming colors, but human brains successfully apply sustained top-down control to suppress automatic word reading across long sequences. The total performance collapse of LLMs under extended interference highlights a fundamental difference: transformer attention lacks an explicit mechanism for up-regulating control when conflict increases.

Source: PNAS Nexus

Overview of findings

The research team used the Stroop paradigm to probe whether transformer attention implements an architecture comparable to human executive control. In the congruent condition (word and ink color match), models performed well at all tested lengths. In the incongruent condition (word and ink color conflict), models handled short lists similarly to humans but failed to maintain accuracy across longer sequences. Word-reading accuracy remained near-perfect, indicating that the models defaulted back to their dominant training behavior—reading words—rather than sustaining the instructed task of color naming.

These results suggest that transformer self-attention, as currently implemented in LLMs, lacks the adaptive conflict-resolution and top-down modulation that characterize biological attention. As the number of tokens increases, the models do not successfully up-regulate inhibitory control and therefore revert to their most strongly reinforced behavior.

Key questions answered

Q: Why does the Stroop task break advanced AI models?

A: The Stroop task specifically measures executive control—the ability to block an automatic response. LLMs are extensively trained to read and predict text. When asked to ignore word meaning and report font color, this entrenched behavior dominates as sequence length grows, causing the model to default to reading words rather than following the inhibitory instruction.

Q: How badly did newer systems perform on long lists?

A: Performance degraded sharply. For example, GPT-4o dropped from 91% accuracy at five words to 15% at forty words. In tests that mixed congruent and incongruent items, models such as GPT-5, Claude Opus 4.1, and Gemini 2.5 collapsed to near-zero accuracy for the incongruent entries.

Q: What does this tell neuroscientists about human versus machine attention?

A: It demonstrates a structural limitation in transformer attention: unlike human attention, which can sustain top-down inhibition over extended input streams, transformer architectures currently lack mechanisms to adaptively increase control under rising interference. This explains why synthetic attention struggles to resist its own training biases in complex, long-context scenarios.

Editorial notes

This article was edited by a Neuroscience News editor.
The journal paper was reviewed in full by staff.
Additional context was added by the editorial team.

About this AI reasoning research news

Author: Jin Fan
Source: PNAS Nexus
Contact: Jin Fan – PNAS Nexus
Image: The image is credited to Neuroscience News

Original Research: Closed access. “Deficient executive control in transformer attention” by Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan. PNAS Nexus
DOI: 10.1093/pnasnexus/pgag149

Abstract

Deficient executive control in transformer attention

Transformers in large language models implement a powerful self-attention mechanism that transformed natural language processing, but they do not include an explicit architecture for the executive control of attention found in biological systems. Executive control is essential for resolving conflicts and selecting relevant information when competing computations arise, and it supports adaptive behavior in humans.

To assess the consequences of this absence, the authors applied the color Stroop task, a standard measure of inhibitory control, to transformer-based models. Results showed the expected conflict effect in short incongruent lists, with reduced accuracy compared to congruent lists. However, as list length increased, performance on incongruent trials degraded toward near-total collapse, while congruent accuracy and word-reading performance remained high.

These findings indicate that transformer attention mechanisms are fundamentally limited in conflict resolution across extended contexts and fail to up-regulate control adaptively under rising interference. The authors propose that integrating executive-control-like mechanisms, more analogous to biological attention, may be necessary to advance toward robust, general artificial intelligence.