Summary: Simple for humans, the game of “fetch” exposes a hard problem for robots: identifying a specific object inside a cluttered space. Researchers at Brown University found inspiration in an unexpected place — dogs. By studying how dogs use human gaze and pointing, the team developed a multimodal AI framework called LEGS-POMDP that combines spoken language with physical gestures to guide robot search under uncertainty.
In laboratory trials, the system achieved an 89% success rate locating the correct object in challenging scenes, far outperforming approaches that rely on language or vision alone. The method pairs a vision-language model (VLM) with a probabilistic planning model so robots can reason about ambiguous signals and take actions that reduce uncertainty, such as repositioning to get a clearer view.
Key Facts
- Multimodal reasoning: The robot interprets both verbal commands and human gestures. Instead of treating pointing as a single line, the system constructs a probabilistic “cone” based on the person’s eye, elbow, and wrist alignment to estimate where the target likely lies.
- POMDP planning: The core planning engine is a Partially Observable Markov Decision Process (POMDP). It models uncertainty explicitly and chooses sensing actions, like moving or changing viewpoint, when observations are ambiguous.
- Canine inspiration: Insights from the Brown Dog Lab — which studies how dogs read human gaze and pointing — informed the gesture model that converts noisy human signals into a usable probability distribution.
- Performance boost: Combining gesture and language with VLM perception raised object retrieval accuracy to about 90% in lab experiments, demonstrating that “showing” complements “telling” in human-robot interaction.
- Vision-language integration: A VLM provides scene-level understanding and links natural language descriptions to visual observations, enabling richer, more flexible commands than rigid, preprogrammed labels.
Source: Brown University
Overview
Robots that can fetch items reliably in everyday environments would be useful in homes, workshops, and healthcare settings. The Brown University team focused on improving how robots interpret what people mean when they combine words with gestures — for example, when someone says “that book” while pointing in a cluttered room. The researchers will present their findings on Tuesday, March 17 at the International Conference on Human-Robot Interaction in Edinburgh.

“Searching for objects requires a robot to navigate large, cluttered spaces,” said Ivy He, a Brown graduate student and lead author. “Robotic perception works well when objects are clear and unobstructed, but real environments contain occlusions, lookalike items, and repeated objects. This work shows how combining language and gesture improves a robot’s ability to find the correct target.”
Real-world perception is noisy: cameras and classifiers are uncertain, multiple similar items may be present, and some objects are partially hidden. The POMDP framework the team uses makes those uncertainties explicit. Rather than making a single blind choice, the robot maintains probabilistic beliefs about what it sees and selects actions that either gather more information or lead to a confident decision.
To model gestures, He drew on experimental work from the Brown Dog Lab led by Associate Professor Daphna Buchsbaum. That lab has documented how dogs excel at reading human body language — particularly gaze and pointing — to solve cooperative tasks. Using human subject data on pointing and gaze, the team modeled a pointing target as a cone-shaped probability distribution anchored on a line through the eye, elbow, and wrist. This produces a realistic estimate of where a person’s gesture is aimed.
“Humans naturally align their eye gaze with their pointing gesture,” He explained. “A cone defined by eye-to-elbow-to-wrist orientation is a simple, effective way to capture the spatial uncertainty of human pointing and to translate it into a signal a robot can use.”
Buchsbaum added that canine behavior offers a useful blueprint: “Dogs are extraordinarily sophisticated communicators with humans. Translating their intuitive use of gaze and gesture into a probabilistic model helps robots handle the ambiguity in human signals and supports more intuitive cooperation.”
The gesture cone was integrated with a vision-language model that scores candidate objects based on visual evidence and how well they match spoken descriptions. Together, these inputs feed the LEGS-POMDP planner, which balances exploration (move to obtain better observations) with exploitation (select an object to retrieve) under uncertainty.
In experiments with a quadruped platform searching a messy lab area, the combined gesture-and-language approach located the correct object nearly 90% of the time — a substantial improvement over systems that use only vision or only language cues. The results demonstrate the practical value of multimodal signals for real-world human-robot interaction.
Co-author Jason Liu, who worked on the project while a Ph.D. student at Brown and is now a postdoctoral researcher at MIT, noted: “This framework brings us closer to assistants that understand the way people naturally communicate — through speech, eye gaze, gestures, and demonstrations.”
The research was supported by Brown’s AI Research Institute on Interaction for AI Assistants (ARIA), funded in part by the National Science Foundation.
Funding: Supported by the National Science Foundation (2433429), the Long-Term Autonomy for Ground and Aquatic Robotics program (GR5250131), and the Office of Naval Research (N0001424-1-2784, N0001424-1-2603).
Key Questions Answered:
A: Humans often give vague or imprecise signals: a casual point or a phrase like “that over there.” Dogs evolved to interpret those cues reliably. By modeling how people align gaze and pointing — the same cues dogs use — robots gain a practical way to infer intent in messy, real-world settings.
A: Not biologically. The robot uses a mathematical POMDP to represent uncertainty and make decisions. Dogs rely on instinct and experience; the robot uses probability estimates and active sensing to reduce ambiguity before committing to a choice.
A: The approach is applicable to practical tasks — from home assistance to hospital support — wherever robots must identify specific items amid clutter. Understanding natural human signals reduces the need for perfectly precise voice commands and enables more intuitive collaboration.
Editorial Notes:
- Article edited by a Neuroscience News editor.
- Journal paper reviewed in full.
- Additional context added by editorial staff.
About this AI and robotics research news
Author: Kevin Stacey
Source: Brown University
Contact: Kevin Stacey – Brown University
Image: Image credited to Neuroscience News
Original Research: Findings to be presented at the ACM/IEEE International Conference on Human-Robot Interaction (HRI).