Why Lip-Reading Errors Happen

Summary: Lip-reading is a demanding cognitive skill that requires the brain to interpret speech from visible mouth movements rather than from sound. While researchers have long measured overall lip-reading accuracy, the specific pattern of errors made by lip-readers has been less clear. Historically, many studies assessed errors through an auditory lens—focusing on phonemes (spoken sounds)—instead of examining the visual building blocks of speech.

A recent study from the University of Kansas takes a distinctly visual approach. Using methods from network science, researchers created a vast visual map of roughly 20,000 English words. Instead of organizing entries by sound, the map is built around visemes—the observable shapes and gestures of the lips, jaw, and mouth that correspond to speech.

This viseme-based network reveals a complex visual landscape of language. By analyzing words by how they look in articulation, the team uncovered predictable, structure-driven patterns in lip-reading errors: where a word sits within this visual network largely determines which mistakes people make.

Key Facts

The viseme perspective: Visemes are the visual counterparts to phonemes; this study prioritized visual cues—lip, jaw, and mouth shapes—over auditory signals to map words according to how they appear when spoken.
Look-alike bottleneck: The visual map shows that about one-third of English words share at least one visual twin—another word that looks the same on the mouth—creating persistent perceptual competitors for lip-readers.
Compression and stretch: The visual word space is uneven: some regions are tightly packed with visually similar words while others are more spread out. High-density areas create many look-alike competitors and reduce lip-reading accuracy.
Predictable error paths: Mistakes are not random. When faced with ambiguity, a lip-reader is more likely to identify a visually similar but more commonly used word from the same compressed region of the network.
Small misses: Most errors are narrowly wrong—typically off by just one or two visemes—meaning lip-readers often come very close to the intended word.
Practical applications: The research team is converting these visual maps into clinical training tools to help people who are hearing-impaired reduce the gap between their guesses and intended words. The maps also offer a way to train multimodal AI systems to combine facial feature tracking with audio for more human-like transcription accuracy.

Source: University of Kansas

New research from the University of Kansas uses network science to explain why people make errors while lip-reading.

Michael Vitevitch, professor of speech-language-hearing at the University of Kansas, and his co-authors built a visual representation of approximately 20,000 English words to understand why some words are harder to lip-read than others.

Their findings, published in the Journal of the Acoustical Society of America, shed light on how visual structure influences lip-reading performance and suggest ways to improve training for lip-readers and to enhance automated transcription systems that could combine audio and visual information.

“We wanted to know how people read lips, how accurate they are, and—more importantly—what kinds of mistakes they make,” Vitevitch said. “Many earlier studies reported overall accuracy but didn’t look closely at the characteristics of the errors. There’s a lot to learn from the mistakes themselves, and that was our focus.”

Previous research often framed lip-reading errors in terms of phonemes—the smallest units of sound. Vitevitch’s team took a different tack by concentrating on visemes, the observable gestures of articulation.

“We examined the visual characteristics,” he explained. “Instead of counting how many sounds a person identified correctly, we counted how many visual elements—the visemes—they got right. We based our analysis entirely on what the eyes can see: the movements and configurations of the lips, jaw, and mouth, without relying on auditory information.”

Some words both sound and look similar (for example, kit, cat, cut), while others sound quite different but appear similar on the mouth (for example, vet, fit, fuzz). If you rely only on facial information, many of those distinctions disappear.

From their visual map, the researchers concluded:

People tend to confuse a target word with a more common word from the same visually similar group.
About one-third of English words visually match at least one other word when spoken.
Words with many visual look-alikes are consistently harder to lip-read accurately.
Errors cluster in regions of the network where visually similar words are grouped together; they are not randomly distributed.

“One surprise was how challenging this task is,” Vitevitch said. “People often overestimate their ability. Most errors are only one or two visemes off, so lip-readers frequently capture much of the visual information but still miss enough to confuse the word.”

The visual map helped the team observe how words distribute across the viseme landscape: words that look alike lie close together, while visually distinct words are spaced farther apart.

“The landscape stretches and compresses in ways we didn’t expect,” Vitevitch said. “Some areas become very crowded, increasing the number of competitors and making accurate perception more difficult. Other areas are more isolated, which makes words easier to distinguish visually.”

The KU team plans to use these maps to inform training programs.

“If we track a person’s errors over time, those errors should move closer to the target word,” Vitevitch said. “With practice, people can pick up more visual cues and make more accurate identifications.”

Another promising use is enhancing automatic transcription systems.

“Current platforms do a reasonable job transcribing speech from audio alone, but audio can fail in noisy environments or with poor microphones,” Vitevitch noted. “Adding visual information from a speaker’s face could help machines resolve ambiguities and produce more reliable, human-like transcriptions.”

Vitevitch said the group will continue refining this work, exploring machine-learning applications and practical tools to assist people who need help understanding speech visually.

Co-authors on the study include KU graduate students Maia Flynn and Reid Kelly, and Lorin Lachs of California State University, Fresno.

Key Questions Answered:

Q: What is a “viseme,” and why is it more important for lip-reading than a phoneme?

A: A phoneme is the smallest unit of sound and is central to auditory speech research. For lip-reading, however, the eyes do the work. A viseme is the visual equivalent of a phoneme: it captures how a particular sound looks on the face—the shapes and movements of the lips, jaw, and tongue. For example, words such as “vet,” “fit,” and “fuzz” contain different phonemes but can appear visually identical on the mouth, making them difficult to tell apart by sight alone.

Q: What does it mean that the visual map of English words “stretches and compresses”?

A: Using network analysis, the researchers placed words close together when they look alike and farther apart when they look different. The result is an uneven visual topology: some regions are compressed—many words share similar mouth shapes—creating dense clusters of look-alikes. Other regions are stretched, where words are visually distinct and easier to identify. Compression leads to more perceptual competitors and increases the chance of error.

Q: How can this visual word map be used to improve Artificial Intelligence and daily video calls?

A: Many transcription systems currently rely primarily on audio, which can fail in noisy or low-quality conditions. Integrating a visual word map into machine-learning models would allow algorithms to analyze a speaker’s face in real time alongside the audio stream. Combining what the system hears with what it sees can reduce ambiguity and improve transcription quality, particularly in challenging environments.

Editorial Notes:

This article was edited by a Neuroscience News editor.
The journal paper was reviewed in full.
Additional context was provided by staff.

About this visual neuroscience research news

Author: Brendan Lynch
Source: University of Kansas
Contact: Brendan Lynch – University of Kansas
Image: The image is credited to Neuroscience News

Original Research: Open access. “The visome: Using cognitive networks to examine lip-reading errors in English words” by Michael S. Vitevitch, Lorin Lachs, Maia B. Flynn, Reid Kelly. DOI: 10.1121/10.0044182

Abstract

The visome: Using cognitive networks to examine lip-reading errors in English words

This study applies network science to examine how English words appear visually when spoken rather than how they sound. The researchers built a visome—a network of visual word representations—and compared its structure to a phonological network at three levels: macro (whole network), meso (subsets of nodes), and micro (individual nodes). The goal was to determine how visome structure affects lip-reading performance.

Conventional psycholinguistic measures and network metrics were analyzed alongside two databases of lip-reading errors. Results showed that lip-reading errors often occur more frequently than the intended target words within these datasets.

Target words frequently had uniqueness points that emerged after the end of the word, indicating many words are visually embedded within others in the visome. Words vary in the number of viseme twins they possess—words that look identical when spoken—and those with many twins are lip-read less accurately.

Similarly, words with many viseme neighbors—ones related by addition, deletion, or substitution of a viseme—are also harder to identify visually. Errors tended to fall within the same community as the target word rather than in a different community. The authors conclude that network analysis offers valuable tools for advancing research on lip-reading and for developing applications to assist people who rely on visual speech cues.