System learns to distinguish words’ phonetic components without human annotation of training data

Every spoken language is built from a finite set of phonemes — the smallest units of sound that combine to form words. English, for example, contains roughly 35–45 phonemes depending on how they are counted. Recognizing those phonetic units makes automated speech processing and recognition much more effective. A team of researchers at MIT has developed an unsupervised machine-learning system that not only learns to identify spoken words from raw audio but also discovers lower-level phonetic units such as syllables and phonemes without any annotated training data.

The research, published in 2015 in the Transactions of the Association for Computational Linguistics, addresses a major bottleneck in speech technology: the need for large amounts of hand-labeled audio. Because the system operates directly on unlabelled audio files, it can be applied more easily to new data sets and to languages that lack extensive linguistic resources. That makes it particularly promising for building speech-processing tools for under-resourced languages and for improving the portability of speech systems across diverse speaker populations.

Chia-ying Lee, who completed her PhD in computer science and engineering at MIT and is first author on the paper, explains the cognitive motivation behind the approach: “When children learn a language, they don’t learn how to write first; they learn from speech. By detecting statistical patterns in raw audio, they infer the structure of language. Our model attempts a similar discovery process from acoustic input.” Co-authors include Jim Glass, senior research scientist and head of the Spoken Language Systems Group at MIT’s Computer Science and Artificial Intelligence Laboratory, and Timothy O’Donnell, postdoctoral researcher in the MIT Department of Brain and Cognitive Sciences.

Modeling word frequencies and phonetic variability

Because the model is unsupervised and receives no labeled examples, it relies on a few reasonable assumptions about the structure of spoken language to guide learning. One assumption concerns word frequency: words in natural language tend to follow a power-law distribution, meaning a handful of words occur very frequently while most words appear rarely — the familiar long tail of language statistics. The model does not require exact parameters for that distribution; it only exploits the general pattern.

The most important component of the system is a probabilistic treatment of phonetic variability, which the authors describe as a “noisy-channel” model. In real speech, a single phoneme can be realized by a variety of acoustic forms depending on context, speaker, and position within a word. For example, the sound associated with the letter “t” often changes between word-initial and word-final positions, and between casual and careful speech. The researchers treat an audio sequence as if it were generated by a sequence of idealized phonemes that then pass through a noisy transmission channel that distorts them. The learning task is to infer the statistical mapping between distorted acoustic observations and the underlying phoneme categories. A particular observed sound might map to the phoneme /t/ with, say, 85 percent probability and to /d/ with 15 percent probability; the model learns those associations from data.

This shows blue sound waves and letters in circles. — The system relies on a “noisy-channel” model of phonetic variability: a single phoneme can produce a range of acoustic realizations across different speakers and contexts. Credit: Jose-Luis Olivares/MIT.

In controlled comparisons, the team found that incorporating phonetic variability into the model yields a large improvement over versions that do not model such variation. That demonstrates the importance of explicitly representing the uncertainty and variability that arise in real speech.

Evaluation on lecture recordings

To evaluate the system, the researchers ran experiments on six lecture recordings from MIT. The model reliably discovered and identified the most frequent words used in each lecture. As with any unsupervised system that infers structure from statistics, it also produced some understandable mistakes: in one lecture by New York Times columnist Thomas Friedman, the model treated the repeated phrase “open university” as a single lexical item. Because the two-word phrase occurred frequently in that speaker’s audio while its constituents appeared less often in isolation, the model had no reason to split it into two separate words.

Lee notes that such errors are natural consequences of learning purely from distributional patterns in acoustic input. When a sequence of sounds repeatedly appears together and rarely appears separately, the system will favor treating that sequence as a single unit. With more diverse data that shows the components in different contexts, the model can learn to segment compounds into distinct words.

Independent experts have praised the scale and ambition of the work. Emmanuel Dupoux, director of the Laboratory of Cognitive and Psycholinguistic Sciences in Paris, observes that recent experimental evidence suggests infants learn phonemes and words at the same time. He notes that prior computational studies either addressed only one side of this interaction or worked with small, simplified “toy” problems. By contrast, the MIT study tackles the full interaction between phoneme discovery and lexicon discovery at scale using a sizable speech corpus, a technically demanding achievement.

Implications for speech technology and language acquisition

The approach has two practical benefits for speech technology. First, because it is unsupervised, it reduces the need for expert annotation and can be applied more quickly to new languages and data sets. Second, by learning lower-level phonetic units and their probabilistic realizations, the model can better normalize variation across different speakers and speaking styles, improving robustness and portability of downstream speech-processing systems.

From a cognitive perspective, the model provides a computational hypothesis for how learners might extract both phonemes and words from raw auditory input, using statistical regularities and a probabilistic representation of acoustic variability to infer linguistic structure without explicit supervision.

Abstract

Unsupervised Lexicon Discovery from Acoustic Input

This work presents a model for unsupervised phonological lexicon discovery, addressing the simultaneous learning of phoneme-like and word-like units directly from acoustic input. The model extends earlier unsupervised approaches to phone-like unit discovery and symbolic lexicon discovery by integrating them through a probabilistic model of phonological variation. Experiments show the model is competitive with state-of-the-art spoken term discovery systems, and analyses illuminate the linguistic structures the model acquires.

Source: Larry Hardesty, MIT

Image Credit: Jose-Luis Olivares/MIT

Original Research: “Unsupervised Lexicon Discovery from Acoustic Input” by Chia-ying Lee, Timothy J. O’Donnell, and James Glass, Transactions of the Association for Computational Linguistics, published online September 2015. doi: Unavailable