How AI Processes Speech Like the Human Brain

Summary: Researchers at the University of California, Berkeley, report that artificial intelligence systems can process auditory signals in ways that closely resemble how the human brain encodes speech. By recording brain activity with scalp electrodes while people listened to a single syllable and comparing those responses to signals from an unsupervised AI trained on English, the team found striking parallels in waveform shape and timing. These results offer a concrete step toward understanding how AI models transform sound into internal representations and may help guide development of more interpretable and reliable systems.

Key Facts:

  1. In a study published in Scientific Reports, researchers observed that signals produced by an AI system trained on English exhibited shapes and timing patterns highly similar to brain waves recorded while participants listened repeatedly to the syllable “bah.”
  2. The study used a system of electrodes on participants’ scalps to capture brainstem and cortical responses to thousands of repetitions of the same sound, then compared those neural traces directly with responses recorded from convolutional layers inside an unsupervised neural network.
  3. Directly comparing raw waveforms — without linear transformations or heavy preprocessing — revealed substantial parallels, suggesting that some aspects of acoustic encoding are shared between biological and artificial networks.
  4. Improving our understanding of what happens between input and output in AI models is essential as these systems are increasingly used across healthcare, education, and other sectors; such insights could help mitigate bias and reduce unexpected errors.

Source: UC Berkeley

New findings from the Berkeley Speech and Computation Lab show that an unsupervised neural network’s internal responses to speech mirror brain activity recorded from human listeners, offering an interpretable bridge between machine and human speech processing.

The researchers placed electrodes on participants’ heads to record electrical activity as volunteers listened to a single spoken syllable — “bah” — repeated thousands of times. They then fed the identical audio into an unsupervised convolutional neural network trained to represent English speech and recorded the network’s intermediate-layer activations to the same stimulus.

“The waveform shapes line up in a striking way,” said Gasper Begus, assistant professor of linguistics at UC Berkeley and lead author of the study published in Scientific Reports. “That similarity indicates that comparable acoustic features are being encoded by both systems.”

Unlike many prior comparisons between biological and artificial systems, this work compares raw averaged signals directly — closely analogous to how electroencephalography (EEG) aggregates neural activity in time. The team relied on a technique developed in the Berkeley lab that averages activity across time to produce clear, time-domain traces for both brain and model responses.

This shows a brain and sound waves
The researchers transmitted the same recording of the “bah” sound through an unsupervised neural network that could interpret sound. Credit: Neuroscience News

The study examined peak latency — the timing of key waveform peaks relative to the stimulus — in both the brainstem response (cABR) and intermediate convolutional layers of the AI. Across eight trained networks, including a replication experiment, the researchers documented consistent similarities in how peak latencies related to the input sound and showed effects of prior language exposure on timing in both biological and artificial systems.

Begus and his co-authors, Alan Zhou (Johns Hopkins University) and T. Christina Zhao (University of Washington), emphasize that understanding these shared encodings can help demystify what happens inside AI “black boxes.” As powerful language and audio models become more widely used, clarifying how they represent sound is crucial for interpretability, safety, and reducing unintended biases.

The team plans further work comparing signals obtained with other brain imaging techniques and exploring how different languages — for example, tonal languages such as Mandarin — shape both human and machine representations. Speech is a particularly tractable domain for these investigations because spoken languages use a relatively small set of phonetic elements, making it easier to analyze core encoding mechanisms than in modalities with far more granular variation, such as color or written text.

“If we want to understand how these models learn, we need to begin with simple, well-understood elements,” Begus said. “Speech gives us that starting point, and these direct waveform comparisons provide a measurable benchmark for how closely artificial architectures resemble human processing.”

About this artificial intelligence research news

Author: Jason Pohl
Source: UC Berkeley
Contact: Jason Pohl – UC Berkeley
Image: The image is credited to Neuroscience News

Original Research: Open access. “Encoding of speech in convolutional layers and the brain stem based on language experience” by Gasper Begus et al., Scientific Reports


Abstract

Encoding of speech in convolutional layers and the brain stem based on language experience

Recent advances have compared artificial neural networks with neuroimaging outputs in vision and text domains. This study proposes a framework to compare biological and artificial neural computations for spoken language by applying a time-domain averaging approach similar to EEG. The method enables direct comparison of how acoustic properties are encoded in the brain and in intermediate convolutional layers of deep neural networks without applying linear transformations between signals.

Using this technique, the authors show that the brainstem response (cABR) and responses in intermediate convolutional layers to the identical stimulus are highly similar in waveform shape and timing. The approach quantifies peak latency relative to the stimulus in both systems and examines the influence of prior language exposure on latency encoding. Results from eight trained networks, including a replication, reveal substantial parallels between human and network responses. The technique is generalizable and can be used to compare encoding of any acoustic property across neuroimaging modalities and deep convolutional layers.