Lights❌ Language✅ Camera...Action

If you are here from the last one, you would be reeling from all the researchy stuff we navigated or rather skimmed through back there. But Hey, you never really get to understand ideas completely unless you do a deep dive. And that is what we are gonna do here, a full in-depth review of the Quar-VLA papers which we discussed in the last one. So welcome to another article where you lay back with your popcorn, and let me do the heavy lifting of breaking down the complex jargon for you.

Well, it’s not that dumb to be honest

If you noticed one major criticism of the papers discussed in the last blog were that they were mapping high level commands to a bunch of small tasks and not really into low-level motor or sensor commands. This sort of leaves the headache of infering the low level actions from the high level commands to the action manager, which then also needs to be trained as to what exactly particular “words” or “commands” map to in the action space. This is partly addressed by the Quar-VLA paper, which deals directly in the Vision-Language-Action Space, and they have also developed a specific VLA model for the same along with a multi-task dataset. But first let us explore their methodology and approach in depth, shall we?

src1

Vision Language Action Models

An important thing to undestand here that if we are planning to build a model capable of producing motor-level commands from Vision and Language, we also need to look after the frequency at which the commands need to be sent. Because in the case of a quadruped, the actions that we output can neither be too simplistic like the velocity commands given by the nav planners, or require a very high frequency similar to the motor level commands. Hence having a dataset that bridges this gap is also of prime importance and this is one of the things the authors have aimed at, and QUAR-VLA seems to be the first architecture that integrates Vision and Lanuage together to generate actions.

To this end, we come to talk about Vision Language Action Models that integrate the visual input from the various sensors on the Quadruped along with the language commands. The end goal more or less is to train a conditional policy QUART that can interporet RGB Images, and high level commands. This policy takes RGB Images and the instructions as input and produces actions as output. Now in their paper specifically the authors aimed at a 11-dimensional space, and each action command looks something like:

\[\{ v_x, v_y, ω_z, θ_1, θ_2, θ_3, f, h_z, φ, s_y, h_z^{f} \}\]

Here, vx, vx, and ωx represent the velocities along the x-axis, y-axis, and z-axis respectively. θx, θx, and θx indicate the gait pattern, f denotes the frequency, hx represents the height of the robot, φ denotes the pitch angle, sx corresponds to the foot width, hxf represents the foot height, and t indicates the termination signal of the action.

Generating a discrete action space

So an action policy is basically supposed to take in the images and the language input data and generate a bunch of action commands in the 11 dimensional action space. Now for each possible action(velocity, rotaion angle etc.) there is a continous range of values with a lower and upper bound which makes it very hard for the policy to reliably learn a set of actions given a particular input. Now this does not mean that we keep a set of discrete values, or else that will defeat the purporse of having a smooth action space as output. Discretization here means obtaining a set of bins for each possible action so that we can reliably say that okay “Velocity 5 m/s forward with rotation 30 degrees falls into this bin, choose a continous value from that. Effectively narrowing down the search space and “discretizing” our model’s outputs.

  • Each action has a continous domain with a lower and a upper limit, this interval is divided into 256 bins of equal width = (range)/256
  • And for any give target value, the corresponding index of the bin that it falls into is given by GIF(a - lower bound of the action space)/width of bins

Alright, so given this action space, multiple sets of actions were tried each in a different domain - perception, navigation, manipulation etc. each with a varying level of difficulty on the task. Some involved simple object detection and identification tasks, while others involved more complex identiying, manupilating and moving objects from one place to the other, and an obvious trend was observed. For more complex tasks, we required larger number of episodes as compared to tasks that were relatively simpler.

But what was more interesting was the insights on the relationship between the tasks that were assigned:

  • Perception emerged as the a foudnational task, underpinning the other tasks
  • Basic Navigation came next
  • As an extension of the above two fundamental tasks, arose the four - Spatial Navigation, Environment Adaption, Obstacle Avoidance and Object Manipulation.

The following picture from the paper does a nice job of illustrating the same:

src1

Devil in the Details

I would like to list a few key aspects of the author’s approach, since I am working on mine as well, so if you have stuck around till this part of the blog, that’s terrific, let’s take note of a key constraint involved in the same:

Consistency Constraints - Sim to Real is one of the hardest problems out there in general. Forger about training a fancy quadruped. Transferring polcies from simulation to real life even for something as small as a robotic arm involves a lot of quirks, and the sheer amount of variables involved can be tumultous. Hence to maintain consitency between the simulation environment and the real world, the authors took a few key steps:

  • The starting postion of the robot was always the origin, with the target randomly positioned between a set of fixed coordinates.
  • The data collection setup adheres to a predefined language template where the task, the object, speed and the gait are pre-defined.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • A simple and intuitive guide to using uv - an awesome tool from astral!
  • A quick overview of Byte Pair Encoding tokenizers!
  • Delegation, Discomfort and Decisions!
  • Watch and Learn, forget the speech!
  • Your dog can sniff, mine scans...