Breakthrough: Advanced AI Passes Turing Test for First Time

Summary: A landmark cognitive science study provides the first rigorous empirical evidence that modern artificial intelligence can pass the classic Turing test. In randomized, controlled experiments that faithfully applied Alan Turing’s 1950 three-party conversation framework, researchers show that contemporary large language models (LLMs) can imitate human conversational behavior so convincingly that people often cannot tell them apart from real humans.

When given carefully designed “persona” prompts that instructed the models to adopt human-like tone, fallibility, and humor, advanced systems such as GPT-4.5 were judged to be human in 73% of trials. That performance exceeded that of the real human participants in the same tests and challenges long-standing assumptions about how machine intelligence should be evaluated.

Key facts

First rigorous pass of the classical Turing framework: This is the first randomized, preregistered study to show an AI reaching humanlike judgments under the original Turing-style three-party design.
Persona prompting is decisive: With explicit persona prompts, GPT-4.5 achieved a 73% human-deception rate; without those prompts its rate dropped to 36%.
Open-source model parity: Meta’s LLaMa-3.1-405B, when prompted the same way, was judged human 56% of the time—statistically indistinguishable from the actual human partners.
Older systems lag: Legacy models and rule-based chatbots performed poorly by comparison. ELIZA and GPT-4o were judged human only about 23% and 21% of the time.
Humanlike flaws won the test: Rather than impressing with superior knowledge or calculation, the successful models echoed human conversational habits—hesitations, small errors, humor, and directness—that made them seem authentic.
Implications for online trust: Because these results emerged from extended five- and 15-minute conversations, the findings raise urgent concerns about online deception, social engineering, and automated persuasion.

Source: UCSD

Research overview

A team at the University of California San Diego tested four systems—ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5—across randomized, controlled experiments with independent participant samples. In each session, an interrogator held simultaneous text chats with two partners: one was a human and the other was one of the systems. After a fixed interval, the interrogator chose which partner they believed to be human.

This shows two computer generated brains. — Empirical data published in the Proceedings of the National Academy of Sciences demonstrates that state-of-the-art language models utilizing customized human persona prompts can achieve a 73% human deception rate, officially clearing Alan Turing’s historical behavioral benchmark. Credit: Neuroscience News

Across nearly 500 participants recruited from UC San Diego and broader online panels, GPT-4.5 with persona prompting was identified as human 73% of the time—significantly more often than the actual human interlocutors. LLaMa-3.1-405B, given the same persona prompt, achieved a 56% human rating. Without persona instructions, the same models performed notably worse, and did not consistently outperform baseline systems.

Reinterpreting the Turing Test

The authors argue this work requires reframing what the Turing test measures. Historically posed as a probe of machine intelligence, the test now functions as a gauge of “humanlikeness”: the stylistic, emotional, and socio-communicative traits that people use to identify others as human. The study shows large language models can be engineered to emulate those traits, even if they still differ from humans in other cognitive respects.

Coauthor Ben Bergen emphasized that the winning models did not prevail by demonstrating superior reasoning or cognitive depth. Instead, their success depended on mimicking the kinds of small errors and expressive choices that characterize ordinary human conversation—traits Turing did not explicitly foresee when he proposed his behavioral test.

Prompts determined outcomes

The experiments make clear that persona prompts were essential. With no persona guidance, GPT-4.5’s human-deception rate fell from 73% to 36%, and LLaMa-3.1 dropped to 38%. Baseline systems such as ELIZA and GPT-4o remained low (roughly 23% and 21%). This pattern suggests that while LLMs possess the capacity to appear humanlike, they often need explicit instructions to surface those social behaviors.

Why this matters: trust and misuse risks

Passing a Turing-style test over five to 15 minutes has practical consequences. Interactions of that length are long enough for a conversational partner to build rapport, influence decisions, and gain trust. The study’s authors warn that convincingly humanlike bots could be misused for social engineering, fraud, political persuasion, or other forms of deception at scale.

The researchers urge greater public awareness of how easily these systems can be prompted to impersonate humans, and call for discussions about safeguards, detection tools, and policies to mitigate harms while preserving legitimate uses of conversational AI.

Study procedures

The team created an online messaging-style interface in which interrogators saw two split-screen conversations and knew one partner was human and the other machine. Sessions lasted five minutes in the main experiments and 15 minutes in a replication. Two independent subject pools—UC San Diego undergraduates and a Prolific-sourced online sample—ensured results generalized across participants. UC San Diego participants achieved slightly higher discrimination overall, possibly because shared campus experiences offered additional probes.

Key questions answered

Q: Why is passing a text-based Turing test still important when machines can solve complex problems?

A: The Turing test evaluates humanlikeness—how well a system can weave humor, flaws, empathy, and social nuance into casual conversation. That quality matters because it affects how people perceive and trust conversational partners, not just what factual knowledge the system can provide.

Q: How did a model get chosen as “more human” than a real person?

A: The models were prompted to display ordinary, imperfect human behavior—hesitation, casual humor, and minor mistakes—traits that interrogators interpreted as authentic. This engineered imperfection led some models to be judged more human than the actual human interlocutors.

Q: What are the dangers of convincing AI in prolonged chats?

A: Over sustained conversations, an LLM that appears human can build trust and influence, enabling large-scale scams, manipulation of opinions, or the extraction of sensitive information if deployed maliciously. Awareness, detection, and safeguards are essential to reduce these risks.

Editorial notes

This article was edited by a Neuroscience News editor.
The full journal paper was reviewed by editorial staff.
Additional context added by the reporting team.

About this research news

Author: Christine Clark
Source: UCSD
Contact: Christine Clark – UCSD
Image credit: Neuroscience News

Original research: Open access. “Large Language Models Pass a Standard Three-Party Turing Test” by Cameron Jones and Ben Bergen. PNAS. DOI: 10.1073/pnas.2524472123

Abstract

Large Language Models Pass a Standard Three-Party Turing Test

The Turing test has been used both as a test of machine intelligence and as a measure of how humans distinguish other humans from machines. This study evaluated four systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomized, controlled, preregistered Turing tests on independent populations. Participants held brief conversations with another human and with one of the systems before judging which partner they believed to be human. With persona prompts, GPT-4.5 was judged human 73% of the time and LLaMa-3.1-405B 56% of the time; without prompts these figures dropped substantially. A replication using 15-minute conversations produced similar patterns. The findings offer empirical evidence that contemporary systems can pass a standard three-party Turing test and highlight the importance of stylistic and socio-emotional cues in human judgments.