Your dog can sniff, mine scans...
The title of this blog is kind of misleading, and I apologize for that. This is not about describing our Quadruped at Project MANAS a.k.a Sam, “scans” v/s “sniffs”, it is more about using that distinctive mechanical and electrical ability to improve it’s intelligence. And unless you have been living under a rock, you would know that the hottest intelligent systems in ‘da club right not are LLMs, so I thought to myself, what if we could integrate LLMs with the Quadruped and see how it improves it’s capacity to undertake various tasks.
Well, it’s not that dumb to be honest
The Quadruped procurement for Project MANAS was actually the brainchild of our seniors and super-seniors who had interned at IISC Bangalore’s robotics laboratory. Now, being the powerhouse that IISC Bangalore is, among the various projects being worked upon, the Quadruped was one of them. So why not have our own test bench that allows us to learn about these systems, while at the same time developing and researching in an exciting domain.
So it’s quite smart for a mechanical dog I would say, autonomous navigation is not it’s best suit I’d say, but it does manage to navigate to straight goals, it can follow certain hard-coded voice commands. But what really bugs me is that while you can ask Sam to Wiggle it’s butt and show you a dance, you can’t ask it to simply bring you a document that your friend wants you to have a look at. And I was like well, why not incorporate that system into it then? So I turned to the most intelligent(Supposedly) systems of today - LLMs.
When in doubt, go to the library arxiv
Okay so since we have just begun this venture, just like any good research project I put my head out to sniff if people have already done something similar, and viola did I find it. Every major corporation from Google to Nvidia(I was kinda sad I couldn’t read a boston dynamics paper but okay) is heavily invested in this and have been trying to incorporate LLMs into Robotic agents in order to make them smarter, in today’s blog, I will be briefly breaking down 5 such papers for you, listing out some key points along with what they achieved and what they missed out on. And ofcourse just like anything on the internet today, this list is incomplete without Google.
Do As I Can, not As I say…
Well at first, if you thought that I was trying to be funny with the above heading, I WASN’T, it’s an actual title to an actual paper by people at Google and Everyday Robotics. Computer Science researchers seem to be the most creative people when it comes to naming their works, well anyways what bells does the title ring when you think of it in terms of LLMs and Robotics? Okay I’ll let a picture from the paper explain it to you:
Now imagine if instead of giving instructions to the Robotic arm, the LLM was instructing you(Arghhh they’ll rule us) despite the order in which the LLM is giving the instructions or the content of these instructions, we have a natural intuition that tells us things such as the sponge has to be picked before going to the trash can, or that maybe the cup is still there, I need to remove it before cleaning the area. Therefore we are carrying the instructions forward as “we can” not as “it was told” to us. This highlights a crucial aspect that no matter how smart LLMs are, their responses, and they themselves are not grounded in the real world, solely because they have not interacted with the actual environment.
This is what the SayCan architecture proposed in the paper addresses, as depicted in the picture above, and they do this by treating the feedback from the robot as tokens, essentially making the robot the limbs of the LLM that allows it to interact and gain feedback from the surrounding world.
Work encompasses:
- Leveraging the rich semantic knowledge in LLMs to complete real world tasks.
- It does so by treating all the possible things that the robot can do as “skills” for e.g. lifting an item, navigating to a location are all “skills” in this context.
- For any scenario that the architecture needs to plan for, each skill is assigned a certain “value” as to how appropriate that skill would be in that scenario. Simply imagine trying to develop a website, out of all the dev skills in your bag, certain skills like react, DBMS would be handy here and hence would be assigned higher values by the system.
- Each skill is given a text-label so that it can be easily parsed by the Language model and then logically arranged to form a cohesive response that is well grounded in environmental constraints and achievable.
Research Gaps:
- Since the paper is quite old, and the LLMs were relatively small(in terms of parameter count) their abilities were limited by the training data.
- The range of skills(for instance a simple robotic arm which is fixed, can be dextrous but fail to follow navitation commands) that the agent has also poses a bottleneck.
- The paper does not explore robot planning and language, and other ways of combining language with robotic control.
Can you just break it down for me?
Well the above heading is my goto prompt these days to those Chatbots whenever I feel stuck in a mess of jargon and buzzwords. Oh wait! That makes me think, what if all our instructions are nothing but huge and confusing web of jargon and buzzwords to an agent. For instance when I ask Sam to “walk over to the guy wearing the red shirt with this doc”, could it be that it gets lost in the multitudes of possible actions that it can take but can’t decide for sure just because the action is not realy clear to it. That is where the paper : “Language Models as Zero Shot Planners” comes in, it leverages LLMs to break down high-level tasks into a bunch of smaller and actionable steps(i.e. actions closer to “skills” as described in SayCan paper).
Work encompasses:
- Investigates the possibility of breaking down high level tasks into smaller actionable steps.
- Instead of learning the mapping via step-by-step actions on “how to act”, it relies on the semantic information inside LLMs to extract that mapping.
- But don’t be misled,even though they donot rely on command-action pairs to learn the mappings, the LLMs are shown the mapping between textual representations of high-level and low-level commands. It’s like telling a child: “Go Play football” means “Put your shoes on, step out of the house, walk towards the field, play.” (Gosh kids these days….)
- This preconditioning improves the performance over LLM baselines.
Research Gaps:
- Even though the LLMs can output logically correct commands, the absence of feedback from the environment reduces chances of smaller actions being actually executable. For instance, for the high level command: “Clean the room” -> “grab the vacuum cleaner” might not be an executable step because the environment does not have one.
- Reliance on the belief that the simpler tasks that are generated by the LLM(i.e. mid-level actions) are already doable by the lower-level sensorimotor actions and that these lower-level action donot need any instructions from the LLM.
The power of the sun, in the PaLM-E of my hand…
Well, if you recognized the above dialogue to be from Spider Man, kudos buddy both of us are spidey lovers! But instead if you recognized google’s LLM PaLM in there, hats off ~ ~. See, one thing is sort of obvious now, that LLMs regardless of how good they are, struggle with acting under the constraints of the actual environment i.e. they are not grounded. Imagine them being like a super smart frog in the well who knows everything but struggles to make completely accurate decisions about being in a savannah simply because he has never been there. And if you want to understand the work done by the authors of PaLM-E you can imagine giving the frog extended sensors which sort of tell him the conditions in the Savannah. Their LLM takes sensor data, the state of the robot and textual input all as “sentences” ofc. with relevant encoding. And it made some decent leaps, as in being able to do sequential planning from textual input, however the wise reader would look behind the glam and see that being from google, the model is HUGEEEE, I mean 562B parameter. So it’s like sort of saying GPT 3.5 was better than GPT 3 because it had more parameters. Ofcourse that is not the case entirely here, as they way the model is trained also matters but it’d be interesting to see if lighter adaptations incorporating multiple sensors and sources exist. However some people may think about the parameter count of the chatbots we’ve come to know and love, and question whether 562B is really that huge…
Work encompasses:
- Injects continous sensory data from the robot model into the LLM, meaning the embedding space is multi-modal(basically saying data of different types is encoded to live in the same space where it can be used by the LLM)
- Capable of performing sequential manipulation task and does not need another LLM to break down the task into simpler ones. For instance asked to “Make a Cake Batter from all the ingredients you see”, it successfully decodes the sub-steps involved in the process like (Crack egg, put egg in bowl..etc.)
- Similar to the past papers, relies on the fact that the robot is well-versed in low-level policies i.e. the LLM won’t break down the action of “Crack egg” further into joint or motor commands, the “skills” of basic movement and handling are already learnt by the agent, and PaLM-E is only concerned with providing the high level commands in an autoregressive(one token at a time based on the past sequence) manner.
- Training on large scale vision language datasets, allows the model to be highly accurate in embodied tasks i.e. tasks which require awareness of your surroundings given that you are able to see if not feel it.
Research Gaps:
- Parameter size is a big red flag that throws any question of practical deployement out the window.
- Again, similar to previous papers, reliance on the belief that the simpler tasks that are generated by the LLM(i.e. mid-level actions) are already doable by the lower-level sensorimotor actions and that these lower-level action donot need any instructions from the LLM.
- Performance with noisy sensor data across different environmental settings needs deeper evaluation.
What if you gave me everything in one place?
Well if you noticed something in the papers so far is that all of them generate a bunch of instructions for the agent to follow from the prompt that is given to them. So the structure looks something like a language command, which the LLM parses to generate a symbolc plan and then generate certain mid-level commands that the agent can follow based on the ‘skills’ or low-level sensorimotor movements. There seems to be a slight disconnect between the prompt to the LLM and the final actions performed by the agent, and you rely on the robot’s ability to correctly break down the tasks into the right set of low-level commands. This gap is bridged by Quar-VLA, which treates everything from the initial command to the final actions that the agent recieves as a single pipeline.
So it is somewhat similar to the model actually providing explicit commands to he agent to carry out the actions that would help it achieve the high level semantic task. This can look like “Rotate Hip Joint by 30 degrees”, or “actuate the knee joint” etc.
Work Encompasses
- Proposed a new paradigm that integrates the vision input from the Quadruped’s sensor, the language ingretation and action generation.
- The interesting factor here was that this was not just limited to carrying out tasks, but also extended to navigation and manipulation which required much complex understanding of the environment. Like for instance if I want Sam to navigate in a room towards a red target, it must first be able to detect the target and move towards it.
- The also compiled all of the sensor data collected into a single database(QUARD) which can be utilised for end to end training for other Robotic agents.
Research Gaps
- The environment can be improved - more terrrains, with different configurations.
- They don’t seem to have tested different path planning algorithms extensively, and for control they are only using a PD Controller, maybe we can check on the limitations for that, and improve it.
- Also, while they have shown extensive results in simulation, the results for actual environment might need further evaluations.
- Latency remains an issue, the response time can be improved.
Our Tea for the second time…picked by a bot!
The RT-2 paper does something extra, in the sense the techniques adopted by the users developed good enough policies to allow the Quadrupeds to perform “unseen” tasks. The key to this was the training data employed by the authors. They not only trained on robot demonstration data(i.e. command-action pairs) but also extensive data from the web, which probably explains their ability to be able to carry out the unseen tasks. This also leverages the transformer architecture, which means it is a bit memory intensive, now if you are someone who is not familiar with the transformer architecture, hang tight, an explanation for that will be up soon here :) The usage of the transformer architecture means that the inputs and the outputs are handled as tokens(Just like how GPT or your fav LLM writes in “bits” sequentially, that bit-styled writing is a result tokenization)
Work Encompasses
- Training on robotic demonstrations(sample robotic actions) as well as vast internet data
- From the prompt, the action commands are given out as discrete tokens(coz transformers) which means that continous actions(like rotate 30 deg while moving ahead) are given out as discrete tokens.
- Shows strong performance on semantic reasoning tasks.
Research Gaps
- Low-level controls have not been addressed properly in this architecture as well, that is it is left to the agent to interpret mid-level commands accurately and execute the right low-level sensori-motor commands.
- Well, as awesome as the idea of training on the entire web sounds, it is a difficult mapping from the wide array of meanings one can extract from the various knowledge base of the internet versus the actual grounding that we would expect.
- Besudes, training models on such large databases requires extensive computing resources, and the need to complement simulation data.
Pheww, that was quite a lot of unpacking, don’t you think? But what is awestriking is the speed with which progress is being made in this field, it’s almost always buzzing with some or the other activity. Now with a basic literature review underway, I’ll start setting up the environment for testing the integration of Quadrupeds out, so stick around for more updates, they are just around the corner ;)
Enjoy Reading This Article?
Here are some more articles you might like to read next: