Flexible Machine Learning Boosts Image Classification Accuracy

Giving machine-learning systems “partial credit” during training improves image classification.

Machine learning, the foundation of many modern artificial-intelligence applications, operates on probabilistic principles. For example, an image-recognition model might infer that a picture contains a dog with 60 percent probability and a cat with 30 percent probability. Traditional training approaches treat every incorrect label the same, but researchers at MIT have developed a training strategy that rewards models for producing labels that are semantically close to the ground truth.

At the Annual Conference on Neural Information Processing Systems in December, MIT researchers will present a method that encourages semantically related categories to reinforce one another during training. Under this approach, the algorithm learns that the co-occurrence of labels like “dog” and “Chihuahua” is more meaningful than the co-occurrence of “dog” and “cat,” and it uses those relationships to guide learning.

In tests on Flickr data, the MIT team found that models trained with this strategy predicted the tags users applied to images more accurately than models trained with conventional methods. More importantly, the models produced tags that were thematically closer to human annotations, a useful property for search and organization of large image collections.

“When you have many possible categories, the standard approach trains a separate model for each category using only examples labeled with that category,” says Chiyuan Zhang, an MIT graduate student in electrical engineering and computer science and a lead author on the paper. “That treats all other categories equally unfavorably, even though many categories are semantically similar. We developed a way to borrow information from related categories to improve training.”

Zhang’s co-authors include his thesis advisor Tomaso Poggio, the Eugene McDermott Professor in the Brain Sciences and Human Behavior; Charlie Frogner, a fellow graduate student and co-first author; Hossein Mobahi, a postdoctoral researcher at CSAIL; and Mauricio Araya-Polo, a researcher with Shell.

Close counts

To capture semantic similarity, the team measured how often tags co-occur in Flickr images. Tags that commonly appear together—such as “sunshine,” “water,” and “reflection”—are treated as semantically related. Instead of scoring a model simply as right or wrong for each tag, their training scheme assigns partial credit when a model predicts labels that are close in meaning to the ground-truth tags.

For example, consider an image tagged by users with “water,” “boat,” and “sunshine.” A conventional model that predicts “water,” “boat,” and “summer” would receive no more credit than one that predicts “water,” “boat,” and “rhinoceros,” even though “summer” is clearly more related to “sunshine.” The MIT approach gives the model partial credit for predicting “summer” because it frequently co-occurs with “sunshine” in the dataset; the amount of credit is proportional to the measured semantic similarity.

Image of a tower surrounded by words such as sky and roof.
Flickr users tagged a photograph similar to this one “architecture,” “tourism,” and “travel.” A machine-learning system that used a novel training strategy developed at MIT proposed “sky,” “roof,” and “building”; when it used a conventional training strategy, it came up with “art,” “sky,” and “beach.” Credit: MIT News.

Assigning this partial credit requires more sophisticated evaluation than a binary right-or-wrong score. To compare a model’s predicted tag distribution with the actual tags, the researchers used the Wasserstein distance, a metric for comparing probability distributions that takes into account how far apart labels are in semantic space. Computing the Wasserstein distance directly was once prohibitively slow for large datasets, but recent algorithmic advances made it practical for supervised learning tasks.

The team believes their work is among the first to use the Wasserstein distance as an error metric in supervised learning where model outputs are judged against human annotations. Their implementation adapts the metric to work with real-world data, which often comes as unnormalized tag histograms rather than neatly normalized probability distributions.

Human error

The MIT approach outperformed standard training methods on several evaluation criteria. It was better at reproducing the exact tags users applied on Flickr, and it was markedly better at producing tags that were semantically similar to the human annotations. That distinction matters: for image search and browsing, a broadly correct, thematically aligned tag set can be more helpful than an exact but brittle keyword match.

Human-generated tags are frequently noisy and inconsistent. One example in the researchers’ dataset showed a uniformed mountain biker on a hilly trail. User tags included “spring,” “race,” and “training,” though the scene looked like late season with bare trees and brown grass—tags that conflicted with one another. The MIT-trained model produced labels like “road,” “bike,” and “trail,” which better described the visual content. By contrast, a conventionally trained model output “dog,” “surf,” and “bike.”

The training framework is flexible: the semantic similarity measure could be replaced by any alternative that better captures human intuition, such as curated ontologies that encode hierarchical relationships between categories (for example, dog → collie → Lassie). The researchers plan to explore standard vision ontologies in future work to further refine the approach.

Marco Cuturi, whose algorithmic work helped make efficient Wasserstein computations possible, praised the research for applying the distance directly to learning machines and for addressing practical issues like unnormalized histograms. “They proposed a very elegant solution that is well motivated and computationally efficient,” he said.

About this computational neuroscience research

Source: Larry Hardesty – MIT
Image Source: The image is credited to MIT News
Original Research: The researchers will present their findings at the Annual Conference on Neural Information Processing Systems in Montreal, Canada. The conference ran between Monday December 07 – Saturday December 12, 2015.

Feel free to share this neuroscience article.