Robot see, Robot do: Bots learn by watching human behaviour
Robots following coded instructions to complete a task? Old school. Robots learning to do things by watching how humans do it? That’s the future.
Stanford’s Animesh Garg and Marynel Vázquez shared their research in a talk on “Generalisable Autonomy for Robotic Mobility and Manipulation” at the GPU Technology Conference recently.
In lay terms, generalisable autonomy is the idea that a robot can observe human behaviour and learn to imitate it in a way that’s applicable to a variety of tasks and situations.
What kinds of situations? Learning to cook by watching YouTube videos, for one. And figuring out how to cross a crowded room for another.
Cooking 101
Garg, a postdoctoral researcher at the Stanford Vision and Learning Lab (CVGL), likes to cook. He also likes robots. But what he’s not so keen on is a future full of robots who can only cook one recipe each.
While the present is increasingly full of robots that excel at single tasks, Garg is working toward what he calls “the dream of general-purpose robots.”
The path to the dream may lie in neural task programming (NTP), a new approach to meta-learning. NTP leverages hierarchy and learns to program with a modular robot API to perform unseen tasks working from only a single test example.
For instance, a robot chef would take a cooking video as input and use a hierarchical neural program to break the video data down into what Garg calls a structured representation of the task based on visual cues as well as temporal sequence.
Instead of learning a single recipe that’s only good for making spaghetti with meatballs, the robot understands all the subroutines, or components, that make up the task. From there, the budding mechanical chef can apply skills like boiling water, frying meatballs and simmering sauce to other situations.
Solving for task domains instead of task instances is at the heart of what Garg calls meta-learning. NTP has already seen promising results, with its structured, hierarchical approach leaving flat programming in the dust on unseen tasks, while performing equally well on seen tasks. Full technical details are available on the project’s GitHub.
Feeling Crowded? Follow the Robot
We’ve all been there. You’re trying to make your way through a crowded room, and suddenly find yourself face-to-face with a stranger coming from the opposite direction.
You move right to get around them, but they move the same way, blocking your path. Instinctively, you both move the other way. Blocked again!
One of you cracks a “Shall we dance?” joke to break the tension, and you finally manoeuvre past one another to continue.
Understanding how and why people move the way we do when walking through a crowded space can be tricky. Teaching a robot to understand these rules is daunting. Enter Vázquez and Jackrabbot, CVGL’s social navigation robot.
Jackrabbot first hit the sidewalks in 2015, making small deliveries and travelling at pedestrian speeds below five miles per hour. As Vázquez explained, teaching Jackrabbot — named after the jackrabbits that also frequent his campus — is a vehicle for tackling the complex problem of predicting human motion in crowds.
Teaching an autonomous vehicle to move through unstructured spaces — for example, the real world — is a multifaceted problem. “Safety is the first priority,” Vázquez said. From there, the challenge quickly moves into predicting and responding to the movements of lots of people at once.
To tackle safety, they turned to deep learning, developing a generative adversarial network (GAN) that compares real-time data from JackRabbot’s camera with images generated by the GAN on the fly.
These images represent what the robot should be seeing if an area is safe to pass through, like a hallway with no closed doors, stray furniture or people standing in the way. If reality matches the ideal, JackRabbot keeps moving. Otherwise, it hits the brakes.
From there, the team turned to multi-target tasking, aka “Tracking the Untrackable.” Moving gracefully through a crowd goes beyond immediate assessment of “Is my path clear?” to tracking multiple people moving in different directions and predicting where they’re headed next.
Here the team built a recurrent neural network using the long short-term memory approach to account for multiple cues — appearance, velocity, interaction and similarity — measured over time.
A published research paper delves into the technical nitty-gritty, but in essence, CVGL devised a novel approach that learns the common sense behaviours that people observe in crowded spaces, and then uses that understanding to predict “human trajectories” where each person is likely to go next.
So, the next time you find yourself headed for one of those awkward “Shall we dance?” moments in a room full of strangers, just remember to track multiple cues over time and project the motion trajectories of every human in sight.
Or take the easy way and find a JackRabbot to walk behind. Or better yet, the newly announced JackRabbot 2.0 (with dual NVIDIA GPUs onboard). It’ll know what to do.