Why LLMs Learn Like Humans but Lack Abstract Reasoning

Summary: A new study shows that large language models (LLMs) such as GPT-J produce novel word forms not by applying abstract grammatical rules, but by making analogies to words they have seen during training. When given invented adjectives, the model tended to pick noun forms based on similarity to familiar words, mirroring how people handle unfamiliar language.

Unlike people, however, these models do not consolidate separate occurrences of the same word into a single mental entry. Instead, they appear to rely on many stored examples and pattern-matching across those instances, which helps explain both their fluent outputs and their heavy demand for training data.

Key facts

Analogy over rules: In many cases, LLMs generalize new word forms by analogy to known examples rather than by applying explicit grammatical rules.
No mental dictionary: LLMs treat individual token instances separately rather than forming an abstract, consolidated dictionary entry for each word, unlike humans.
Data-hungry models: This reliance on example-based generalization helps explain why LLMs require vastly more text data than humans to learn similar linguistic patterns.

Source: Oxford University

Overview of the study

Researchers from the University of Oxford and the Allen Institute for AI (Ai2) led a study, published 9 May in the journal PNAS, that investigates how LLMs generalize linguistic patterns. The team focused on GPT-J, an open-source LLM developed by EleutherAI, and compared the model’s behavior to human judgments and to formal cognitive models of language learning.

This shows a brain. — The LLM behaved as if it had formed a memory trace from every individual example of every word it has encountered during training. Credit: Neuroscience News

The central question was whether GPT-J’s generalizations reflect rule-like abstraction or whether they arise from analogy to stored examples. To address this, the researchers examined English derivational morphology—specifically the process by which adjectives become nouns through suffixes such as “-ness” and “-ity.” For many adjectives, both suffixes are possible depending on phonological and lexical history (e.g., happy → happiness; available → availability). This variability makes nominalization an informative test case for distinguishing rule-based and analogical generalization.

The team generated 200 nonce (made-up) English adjectives—words the model would not have encountered in training, such as cormasive and friquish—and asked GPT-J to choose the corresponding noun form: -ness or -ity (for example, choosing between cormasivity vs. cormasiveness). The model’s choices were compared with human responses and with predictions from two cognitive models: one rule-based and one analogical.

Across these experiments, GPT-J’s behavior aligned closely with the analogical model and with human-like analogical judgments. For example, friquish tended to yield friquishness because friquish is phonologically similar to words such as selfish, which take -ness. By contrast, cormasive followed patterns associated with word pairs like sensitive → sensitivity, steering the choice toward -ity.

The researchers also probed nearly 50,000 real English adjectives and found that the model’s outputs matched the statistical frequencies present in its training data with striking precision. This pattern suggests that GPT-J behaves as if it has formed a memory trace for each individual instance it saw during training and consults those traces when producing novel forms: in effect, asking “What does this remind me of?” when confronted with a new adjective.

A crucial difference emerged between human learners and GPT-J. People tend to form a mental lexicon that groups all instances of the same word into a single conceptual entry, making it easy to recognize that nonce forms such as friquish and cormasive are not established English words and to generalize from a compact set of known types. By contrast, GPT-J generalizes over many specific token instances without merging them into abstract dictionary entries, making it more dependent on large volumes of varied examples.

Janet Pierrehumbert, Professor of Language Modelling at Oxford and senior author of the study, remarked that while LLMs generate language impressively, they do not generalize as abstractly as humans do—a characteristic that likely contributes to their need for far more training data. Co-lead author Dr. Valentin Hofman (Ai2 and University of Washington) emphasized the value of combining linguistics and AI approaches to better reveal what is happening inside LLMs and to guide improvements toward more efficient and explainable systems.

The research team also included collaborators from LMU Munich and Carnegie Mellon University. Their findings suggest that analogical processes play a larger role in LLM linguistic generalization than had previously been appreciated, with implications for model design, data efficiency, and interpretability.

About this AI and learning research news

Author: Philippa Sims
Source: Oxford University
Contact: Philippa Sims – Oxford University
Image: The image is credited to Neuroscience News

Original research (open access): “Derivational Morphology Reveals Analogical Generalization in Large Language Models” by Janet Pierrehumbert et al., published in PNAS.

Abstract

Derivational Morphology Reveals Analogical Generalization in Large Language Models

What mechanisms underlie linguistic generalization in large language models (LLMs)? Prior work has focused largely on regular phenomena where rule-based and analogical accounts make similar predictions. This study instead examines variable patterns in English adjective nominalization, a domain where rule and analogy diverge.

Fitting cognitive models that instantiate rule-based and analogical learning to GPT-J’s training data, the researchers compared their predictions on nonce adjectives with GPT-J’s actual outputs. For regular nominalizations both models perform similarly, but for variable cases the analogical model matches GPT-J far better. Moreover, GPT-J is sensitive to the frequencies of individual word forms in its training data, a behavior consistent with analogy but not with rule-based abstraction.

These results argue against a primarily rule-based account of GPT-J’s generalization on adjective nominalization and point instead to analogy as a key mechanism. Overall, the study highlights the importance of analogical processes in LLMs’ linguistic behavior and suggests directions for making models more data-efficient and interpretable.