Breakthrough AI Passes the Turing Test for the First Time

Summary: A landmark cognitive science study provides the first rigorous empirical evidence that a modern artificial intelligence can pass the Turing test. Using a randomized, controlled design based on Alan Turing’s 1950 framework, researchers evaluated whether current large language models (LLMs) can mimic human conversation convincingly enough that people cannot reliably distinguish them from real humans.

When configured with tailored “persona” prompts, advanced systems such as GPT-4.5 were judged to be human 73% of the time. This performance exceeded human baselines in the same experimental setting and shifts how we think about machine intelligence and social behavior.

Key Facts

Historic milestone: This work is the first rigorous demonstration—using the standard Turing test framework—that an AI system can match or surpass human-to-human judgments within the test.
Persona prompting matters: Prompt engineering was crucial. With specific persona instructions that encouraged fallibility, conversational tone, and humor, GPT-4.5 reached a 73% human deception rate. Without those instructions its rate fell to 36%.
Open-source performance: Meta’s LLaMa-3.1-405B, when given the same persona prompt, achieved a 56% human rating—statistically indistinguishable from real humans in this setup.
Older systems lag: Classic rule-based chatbots and earlier model generations performed poorly. ELIZA and GPT-4o were judged human only about 23% and 21% of the time, respectively.
Winning by being humanlike: Coauthor Ben Bergen emphasized that models succeeded not by superior knowledge but by imitating human imperfections: small mistakes, conversational directness, and relatable responses.
Societal risks: Because the test involved extended five- and 15-minute interactions, results raise urgent concerns about online deception, social-engineering scams, and automated political persuasion.

Source: UCSD

A new study from the University of California San Diego reports the first empirical evidence that a contemporary AI can pass a standard Turing test — a benchmark that asks whether a machine can imitate human conversation closely enough to be mistaken for a person.

In controlled experiments, many people could not reliably tell advanced LLMs apart from humans when those models were prompted to adopt humanlike personas.

This shows two computer generated brains. — Empirical data published in the Proceedings of the National Academy of Sciences demonstrates that state-of-the-art language models utilizing customized human persona prompts can achieve a 73% human deception rate, officially clearing Alan Turing’s historical behavioral benchmark. Credit: Neuroscience News

Published in the Proceedings of the National Academy of Sciences, the study is the first to apply Turing’s original three-party conversational test to modern LLMs in a preregistered, randomized, and controlled design. It is also the first to report models being judged human as often as actual people in this framework.

“When given the right prompts, advanced LLMs can adopt tone, humor, directness, and fallibility in ways that make them appear human,” said Cameron Jones, the study’s corresponding author, who conducted the research as a doctoral student in Cognitive Science at UC San Diego.

The experimental task placed one participant (the interrogator) in simultaneous text chats with two other parties: one real human and one AI system. After a short conversation, the interrogator decided which of the two partners was human.

Researchers ran these controlled trials across two independent groups — UC San Diego undergraduates and a broader online sample — and tested four systems: GPT-4.5 and LLaMa-3.1-405B as contemporary models, plus earlier baselines GPT-4o and ELIZA.

With persona prompts instructing the models to act like ordinary humans, GPT-4.5 was selected as human 73% of the time, chosen significantly more often than the real human conversational partners. LLaMa-3.1-405B reached a 56% human rating under the same conditions, statistically similar to the human baseline. By contrast, ELIZA and GPT-4o were mistaken for humans only about 23% and 21% of the time, respectively.

76 years after its inception, the Turing Test takes on new meaning

The authors argue the results call for reconsidering what the Turing test measures. “Originally the Turing test asked whether machines could rival human intelligence,” said Ben Bergen, a cognitive science professor and coauthor. “Today, machines can often answer questions faster and more accurately than people, so the test increasingly probes humanlikeness — the social and stylistic cues we associate with being human.”

Each model received a persona prompt that specified a conversational character and style. Bergen noted the models’ success came from mimicking human flaws and social behavior rather than demonstrating superior reasoning or raw problem-solving ability.

Prompts mattered — dramatically

Prompting made a substantial difference. Without persona instructions, GPT-4.5’s selection rate dropped to 36% and LLaMa-3.1’s to 38%. Baseline systems continued to perform poorly without persona guidance. These results suggest that while LLMs can behave convincingly human, they often need explicit direction to adopt those behaviors.

“The models can appear humanlike when told how to, but they don’t always infer that behavior on their own,” Bergen said.

Why it matters: trust, deception and the rise of “counterfeit people”

The findings have practical consequences for online trust. Because models can sustain humanlike behavior over extended five- and 15-minute conversations, users should be cautious when interacting with unknown parties online.

“It’s relatively easy to prompt these systems to be indistinguishable from people,” said Jones, now an assistant professor of Psychology at Stony Brook University. “We need greater awareness and safeguards, because people are likely to overestimate whether a stranger online is really human.”

Both authors warn about malicious uses: automated scams, targeted persuasion, and social-engineering attacks become more effective when malicious actors can deploy believable conversational agents at scale.

The research team hopes these findings inform public discussion and policy on how to detect and regulate deceptive uses of conversational AI.

A live, head-to-head test of human vs. machine

To run the trials, researchers created a messaging-style interface in which interrogators asked questions of two chat partners side-by-side. After five minutes (and in a replication, after 15 minutes), interrogators chose which partner they believed to be human.

Nearly 500 people participated across the studies, with samples from UC San Diego’s undergraduate participant pool and a broader online panel. UC San Diego participants performed marginally better, possibly because shared experiences allowed them to probe for local or campus-specific knowledge.

A version of the study’s Turing test interface is publicly accessible through the authors’ demonstration site.

Key Questions Answered:

Q: If AI can already solve very difficult problems, why is passing a text-based Turing test still significant?

A: Because the Turing test measures humanlikeness—how well a system can weave social cues, humor, empathy, and imperfections into a conversation. While many AI systems excel at information retrieval and calculation, passing this test shows they can convincingly emulate the social behaviors that make interactions feel human.

Q: How did a machine get judged “more human” than an actual person?

A: The models were prompted to adopt a human persona and to include plausible mistakes, hesitation, and casual humor. Real human interlocutors often type awkwardly or falter under pressure; the engineered imperfections in the models matched those human characteristics and led interrogators to mistake them for real people.

Q: What are the dangers if an AI can convincingly lie for 15 minutes?

A: If an LLM can sustain a believable human persona for extended interactions, it can be weaponized for large-scale deception: extracting sensitive information, manipulating opinions, or promoting fraudulent products. The findings underscore the need for detection tools and policies to reduce such risks.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by staff.

About this AI research news

Author: Christine Clark
Source: UCSD
Contact: Christine Clark – UCSD
Image: Credit to Neuroscience News

Original Research: Open access.
“Large Language Models Pass a Standard Three-Party Turing Test” by Cameron Jones and Ben Bergen. PNAS
DOI: 10.1073/pnas.2524472123

Abstract

Large Language Models Pass a Standard Three-Party Turing Test

The Turing test has long been used to probe machine intelligence and to study how people distinguish humans from machines. We evaluated four systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomized, controlled, preregistered three-party Turing tests run on independent populations.

Participants held five-minute conversations simultaneously with another human and one of the systems, then judged which partner was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged human 73% of the time—significantly more often than the actual human participant. LLaMa-3.1-405B reached a 56% human rating under the same prompts, not significantly different from human performance. Without persona prompts, these models performed significantly worse (38% and 36%) and did not consistently outperform baseline systems ELIZA and GPT-4o (23% and 21%).

A replication with 15-minute interactions produced similar results: two persona-prompted models achieved pass rates of 56% and 59%. Interrogators based their judgments more on stylistic and socio-emotional cues than on classic measures of intelligence. These findings have implications for how we interpret the capabilities of large language models and for the social consequences of increasingly humanlike AI.