Watch and Learn, forget the speech!

This is gonna be another one of the papers that I came across when working on integrating LLMs with the Quadruped, and it was rather an interesting find. While the ideas were intuitively simple, the methodology was definitely doing something unique and out of the box.

I have a soul, but no body….

Providing instructions can be a bit tricky sometimes, mostly when we are not fully aware of the state/condition of a person whom we are trying to instruct. You could ask me how to boil an egg, and I would give you the steps to it, but that would also require me to know the resources you are working with, and your immediate surroundings or else the recipe will fall short. This applies more appropriately to Large Language Models breaking down complext actions for Robots, since the LLM lack any physical connection to the real world, their suggestions on how to do a particular task may come as very generalized or may turn out to be impossible to carry out(e.g. “Asking a robot to turn the AC on, but the room only has fans”). So this paper approaches this as sort of embodying the LLMs through a robot, to provide a better seed context for the LLM to produce informed actions.

But the way they have proposed this embodiment is also rather interesting - each action that the robot can perform is given a certain ‘affordability’ value which is nothing but the probability that given the current state, how likely will that action result in a progress with the task. This score at each step makes the LLM aware of the robot’s ability to perform a certain sub-action. Hence the title of the paper ‘Say-Can’ in which the ‘Say’ part actually breaks a high level task into a sequence of sub-actions that can be performed to the task’s completion, while the ‘Can’ part is a learned affordance function that dictates the possiblity of carrying out the actions at every step.

Getting the preliminaries right

Let’s talk about the components involved here. The first obvious thing are the LLMs, and to scratch the surface of what LLMs do - they basically generate the next token in a sequence based on a bunch of probability scores, that they obtain via the ever popular attention mechanism, this forms the ‘say’ part of the the whole pipeline. The next step is to be able to rank actions based on the robot’s given current state, and the go-to answer for the same was reinforcement learning, that too of a special kind - something called as Temporal Difference(TD) reinforcement learning. Here, the things are similar to your run-of-the-mill Markov Decision Process in terms of the reward function, the difference is in how the rewards are calculated, in non TD-RL the rewards are given at the end of an episode, while in TD, we sort of guess the future rewards that we might get, add them to the current reward and then evaluate our actions. This seems the right choice given the fact that the current action heavily depends on future actions, for instance you won’t decide on going to the museum tomorrow, if it’s a public holiday. And if you are someone who has read the classic paper of Reinforcement learning models learning the expression of Overall, the equation looks something like:

$V(s_t) \leftarrow V(s_t) + \alpha \big[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \big]$ Here, V() is the value function that represents how ‘good’ a particular state is, r is the reward of the current state, and $\gamma$ is the discount factor, which basically decides how much to weigh the future reward. And this happens at every step, unlike the Monte-Carlo method where:

$V(s_t) \leftarrow V(s_t) + \alpha \big[ G_t - V(s_t) \big]$ is calculated at the final step, where $ G_t $ is the cumulative sum of rewards after all the steps.

src1

The Problem Statement

Alright so we need the following things:

Given a complete task description, we want the LLM to break it down into sub-actions.
The sub-actions also have a few constraints :
- The robot should be able to follow them
- The actions should lead to the successful completion of the task.

Now this is carried out in two steps:

Every low level action that the robot can directly perform is given a textual label l. From the high level instruction, the LLM tries to generate a sequence of steps which are basically textual labels for the robot’s actions. The LLM also generates probability values for each textual label which dictates the validity of that action given the state
- e.g. consider the task of getting a tool from the other end of the room, and the state of the bot being in the centre. Here, so the action of moving towards the tool would be given a higher probability by the LLM.
Regardless of how favorable a given action is given the current state, we also need to consider the fact that the choosing that particular action also leads to the task completion, which is dictated by a Bernoulli Random Variable.(A Bernoulli Variable is a yes or no random variable that tells us whether the task is completed or not)

So given a task, the architecture SayCan combines the probabilities of skills being relevant to the task with the probability from a affordability function that dictates whether the skill will lead to the completion of the task, and a product

Bringing ideas to Reality Robotics

Now, even though it’s the LLMs task to come up with do-able tasks for the robot. It must know what tasks can the agent actually do. So we provide the LLM with a bunch of actions, their language descriptions, a value function which calculates the ‘affordance’ of the action, because the model needs a place to start. Now comes the question of how we train the agent to perform these skills and obtaining the affordance of these skills.

Two methods were empoyed for the same:

Imitation Learning - The agents are exposed to multiple instances of tasks being performed and learns effective low level policies from that, their performance is then evaluated by human actors which decides if the task was compelted successfully or not.
Multi-Task Learning - A suite of tasks(Handling, manipulating objects, movement) are performed and the robot learns common functionalities across those tasks. So instead of training policies for single tasks, they train multi-task policies i.e. a policy that can solve more than one task(Jargon alert: A Policy is nothing but something that translates given real world inputs into an action)

So the crux of the matter is that the LLM breaks down the instructions into simpler sub-tasks. The LLM basically generates the language description of the sub-tasks, then they used a sentence encoder to generate an embedding for each sentence. These embeddings are then fed to the policy and the value function that jointly outputs what action would be best suited for that scenario.

Training the Low-Level Skills

Just like any robotics problem we start with training the agents in the simulation, and the underlying Markov Decision Process consists of a reward function, as well as skill specifications that are used by the model. But in order to expand over this idea and train the agents at scale in real world and to learn a RL policy that adapts well to language they used MT-OPT in the Everybody Robots simulator using RetinaGAN sim-to-real transfer. The action space of the policies includes six-degrees of freedom of the end-effector pose as gripper open and close commands, x-y position and yaw orientation delta of the mobile base of the robot, and the terminate action.

A wide variety of skills were chosen for the robotic arms to be trained on, most of which included the skills that would be posed in a kitchen environment.

source

Another advantage that this methodoloy carries is the freedom of choice when in comes to the LLMs that we use for the various tasks. The architecture is very modular and while we can use a different LLM for breaking down the higher level instructions into lower ones(planning), we can use a different one for obtaining the embeddings to each sentence. SayCan propses a unique approach to mapping higher level tasks into actions that are feasible, viable and appropriate for a robotic agent given a particular state.

I have a soul, but no body….

Getting the preliminaries right

The Problem Statement

Bringing ideas to Reality Robotics

Training the Low-Level Skills

Enjoy Reading This Article?