How GPT-3 Performs on Cognitive Tests

Summary: Researchers evaluated the cognitive capabilities of the large language model GPT-3 and found it matches human performance in some tasks but lags in others, likely because it lacks real-world interaction and embodied experience.

Source: Max Planck Institute

Researchers at the Max Planck Institute for Biological Cybernetics in Tübingen assessed the general intelligence of GPT-3, a widely used large language model.

Applying established methods from cognitive psychology, the team tested GPT-3 on abilities such as decision-making, targeted information search, causal reasoning, and the capacity to revise initial intuitions. They compared the model’s responses to human data to evaluate both accuracy and the types of errors made.

The results reveal a nuanced picture: GPT-3 performs on par with humans in several vignette-based decision tasks, yet it falls short in tasks that require causal inference or active exploration—domains likely dependent on interacting with the physical world.

Large neural networks like GPT-3 are trained on massive text corpora and excel at producing fluent, human-like text. Beyond creative writing, GPT-3 can solve math problems and write code, which raises the question of whether its internal computations reflect human-like cognition.

The Linda problem: to err is not only human

To probe human-like reasoning, Marcel Binz and Eric Schulz exposed GPT-3 to classic cognitive tasks. One example is the well-known Linda problem: subjects read a description of Linda—a young woman concerned with social justice and opposed to nuclear power—and must choose whether she is simply a bank teller or a bank teller who is also active in the feminist movement.

Human participants commonly choose the second option despite its lower probability, a phenomenon known as the conjunction fallacy. GPT-3 reproduced the same intuitive error, preferring the more representative but less probable option. This suggests that the model mirrors typical human response patterns learned from text, rather than applying formal probabilistic reasoning.

Active interaction as part of the human condition

Lead author Binz notes that GPT-3’s response could reflect familiarity with the specific task and its typical answers: the model learns statistical patterns of language and human responses from training data. To distinguish memorized solutions from genuine generalization, the researchers designed new tasks that preserved the structure of the original problems but varied surface details.

This shows the outline of a head — Neural networks can learn to respond to input given in natural language and can themselves generate a wide variety of texts. Image is in the public domain

Across these tasks, GPT-3 showed strong performance on vignette-based decision tasks and even outperformed humans on certain exploration-based experiments, such as a multiarmed bandit task where it displayed model-based reinforcement learning signatures. However, small perturbations to problem wording could drastically reduce its accuracy, and GPT-3 exhibited no consistent signs of directed, purposeful exploration when searching for information.

Most strikingly, the model struggled with causal reasoning tasks that require understanding interventions and outcomes beyond correlations present in text. The authors argue this shortfall stems from GPT-3’s passive training regime: it ingests patterns in written language but does not actively engage with a changing environment, which limits its ability to form causal models like those humans develop through interaction and feedback.

The study therefore positions GPT-3 as a system that captures many patterns of human language use and decision-making but lacks some core aspects of cognition tied to embodied and interactive experience. Its errors often mirror human reasoning biases, yet it also fails where real-world interaction and experimentation would be essential.

The authors suggest a path forward: integrating language models with interactive learning—either through user interactions, simulated environments, or systems that can act and observe consequences—may help future models approximate the fuller complexity of human cognition.

About this artificial intelligence research news

Author: Daniel Fleiter
Source: Max Planck Institute
Contact: Daniel Fleiter – Max Planck Institute
Image: The image is in the public domain

Original Research: Closed access. “Using cognitive psychology to understand GPT-3” by Marcel Binz et al. PNAS

Abstract

Using cognitive psychology to understand GPT-3

The study applies cognitive psychology methods to analyze GPT-3’s behavior across canonical experiments. The model performs impressively on many vignette-based tasks, makes reasonable decisions from descriptions, and shows strengths in some reinforcement learning tests. Yet it is sensitive to minor task changes, lacks evidence of directed exploration, and fails in a causal reasoning task. Together, these findings deepen our understanding of current large language models and motivate further research using cognitive tools to study increasingly capable but opaque artificial agents.