Summary: Can you reliably tell a real human voice from an AI-generated one? A recent study suggests that while people’s conscious judgments often fail, the brain itself begins to detect the difference after brief exposure.
Researchers report that listeners are generally poor at distinguishing “deepfake” speech from genuine human voices, and short training sessions lead to only small improvements in behavioral performance. However, neural recordings tell a contrasting story: after brief perceptual training the brain’s responses to AI-generated and human speech become measurably more distinct. That implies the auditory system adapts to subtle acoustic differences faster than conscious decision-making does.
Key Facts
- Behavioral limits: Participants struggled to reliably identify AI versus human voices. A short training period produced only modest improvement in their explicit judgments.
- Neural sensitivity: EEG measures revealed that training increased the neural separation between AI and human speech, indicating the auditory cortex started to “tag” the two types of voices differently.
- Micro-acoustic cues: The brain appears to pick up tiny differences in timing, rhythm and tone—subtle flaws in synthetic prosody—that do not immediately translate into conscious recognition.
- Adaptation phase: The findings suggest people are still learning which cues to use; neural signals are present even when behavioral performance remains weak.
- Applications: This neural-behavioral dissociation offers a foundation for developing targeted training tools and interventions to help people detect voice-cloning fraud and deepfake scams.
Source: SfN
In a collaboration between Tianjin University and the Chinese University of Hong Kong, a team led by Xiangbin Teng combined behavioral testing with brain recordings to examine whether people can tell AI-generated speech from human speech, and whether brief training improves that ability.
This research is published in the journal eNeuro.

Thirty volunteers listened to sentences spoken by human speakers and by AI-generated clones, judging each sample as either human or AI before and after a short, guided training period. Behaviorally, listeners performed poorly at the task and showed only slight gains after training.
Neural measures, however, showed a different pattern. After training, the researchers observed clearer distinctions in the brain’s responses to human versus AI speech. This indicates that the auditory system began encoding differences in the acoustic signal even when listeners could not yet reliably report those differences.
As Teng explains, the auditory cortex appears to detect fine-grained acoustic cues—small deviations in rhythm, pitch contour or timing introduced by synthetic speech—before those signals are translated into conscious decisions. In other words, the sensory system is adapting ahead of our conscious awareness, which suggests perceptual training could be designed to bridge that gap.
Key Questions Answered:
A: Modern text-to-speech models replicate human prosody—rhythm, intonation and emotional coloring—so the conscious impression of a voice can feel genuinely human. Yet beneath that surface, synthetic systems often leave subtle, consistent irregularities in the sound that the auditory system can detect as a different signature.
A: There is a distinction between perception and decision. Your sensory system may register small acoustic anomalies, but without the learned mapping from those signals to the conclusion “this is synthetic,” you won’t reliably report the voice as AI. Training can help form that mapping.
A: The study shows promise: although short training only slightly improved behavioral accuracy, it produced significant changes in neural responses. That suggests future training programs could teach people to attend to the exact cues their brains already detect, improving real-world detection of deepfake and voice-cloning fraud.
Editorial Notes:
- This article was edited by a Neuroscience News editor.
- The original journal paper was reviewed in full.
- Additional context was provided by the editorial staff.
About this AI and auditory neuroscience research news
Author: SfN Media ([email protected])
Source: SfN
Contact: SfN Media – SfN
Image: Image credited to Neuroscience News
Original Research: Closed access.
Title: Short-Term Perceptual Training Modulates Neural Responses to Deepfake Speech but Does Not Improve Behavioral Discrimination — Jinghan Yang, Haoran Jiang, Yanru Bai, Guangjian Ni and Xiangbin Teng.
DOI: 10.1523/ENEURO.0300-25.2025
Abstract
Short-Term Perceptual Training Modulates Neural Responses to Deepfake Speech but Does Not Improve Behavioral Discrimination
Advances in artificial intelligence have made text-to-speech systems capable of producing voices that closely resemble human speakers, raising concerns about misuse for fraud and deception. To evaluate whether short-term perceptual training can improve detection of AI-generated speech, this study combined behavioral tests with electroencephalography (EEG) recordings.
Thirty participants listened to sentences produced by human speakers and matching AI-generated clones, judging each as human or AI both before and after a short (about 12-minute) training session that explicitly labeled examples as “human” or “AI.” Behaviorally, participants exhibited poor discrimination and showed only minor improvement after training. Neural analyses, however, revealed meaningful changes: temporal response function (TRF) analysis identified significant neural differentiation between speech types at early (~55 ms, ~210 ms) and later (~455 ms) auditory processing stages following training.
Additional EEG analyses, including spectral power and decoding approaches, provided further context but showed more limited differentiation. Overall, the results highlight a dissociation between behavioral and neural sensitivity: listeners find it difficult to consciously discriminate sophisticated AI voices, while the auditory system adapts rapidly to subtle acoustic differences after short exposure. Understanding this neural-behavioral gap is important for designing effective perceptual training protocols and for informing policies aimed at reducing the societal risks posed by realistic synthetic voices.