Pictures of the Unseen...

I have been recently pulled into the world of 3D reconstrucion while I was interning this summer at an awesome lab, and trust me the wild possibilities that exist in this space are just out of this world. I am huge fans of game engines and various high fidelity animations, but what was even more exhilirating was to be able to understand some of the basic principles and concepts behind these techniques, and then utilise them to bring 2d images to life. In this blog, we will be discussing about the various methodologies that I used for 3D reconstruction, and also an exciting aspect that we have taken up recently. So let’s get into it.

The Great Gauss to the rescue

What are the ways in which you can think of representing 2D images in 3D? PointClouds, or maybe structural sheets wrapped around a rough skeleton of the model. Now we will discuss pointclouds and meshes later in the blog, but first I wanted to talk about something very unique and awe-striking - Gaussian Splatting. Now I don’t about you, but having seen the number of places where the name of Carl Fredrick Gauss pops up, just blows my mind. No seriously, you should have a look at it, that dude is awesome.

Okay let’s understand the basic idea first - basically what you have is multiple 3D blobs which are either initialized randomly or based on a sparse reconstruction of the final structure from the images we have(there is a step prior to this called SFM which I’m glossing over here, but that basically involves obtaining camera poses and matching image features from a set of given images.)

These blobs or Gaussians(hail Gauss) too have some specific properties of “volumetric radiance fields”, which basically means that the gaussian blobs we have, are capable of modelling what colors will appear and what brightness when we look at it from certain angles for any point in 3D space(which if you know, sounds very similar to a NERF or Neural Radiance Fields concept-wise but they’re different in the way structures are represented), specifically they look something like this:

src1

But the real magic of Gaussian Splats is that they donot fall under the enraging umbrella of Neural Networks at runtime and hence beautifully illustrates how an algorithm that relies on classical methods and mathematics can still deliver great products. Although just to clarify, it does use backpropagation to optimize the parameters of the various gaussians that are initialized. So in brief once you have a sparse reconstruction of images, you take those 3d points as initial means for the Gaussians all of which have their own specific properties. These properties are then optimized via backpropagation.

During rendering, we need to iterate through each and every pixel in the H x W image because each of them acted as a mean for the gaussian that was initialized and optimized. But again, it is important to note that during this rendering, this doesn’t pass through an MLP or a Neural Network unlike NERFs, which makes Gaussian Splatting a little faster during rendering.

Why make a Mess Mesh?

Okay so it might have felt like blind worship of Gaussian splat in the last section(and it was because I am a huge fan of it) but Gaussian splatting also produces a lot of artifacts in the recosntruction. You can think of as when some blobs or gaussians are too elongted, and can appear spread over the place. Meshes on the other hand, can be imagined like a cloth wrapping around a skeleton structure and taking up the structural features of the model in question. This “cloth” is actually an oversimplification of many different elements that could be connected to each other in either regular or irregular patterns. Meshes although great at representing smooth features and edges, struggle with representing complex lightning conditions, while at the same time, even though gaussian splats can represent intricate details and various lightning conditions, they struggle with stray gaussians, especially at edges and in smooth areas. The following image really delivers the point of a mesh being a cloth stitched from various smaller subcomponents.

src

When they borrow the invisibility cloak…

Now obviously there are many more methods for 3D reconstruction and still more variants of the same, and I could keep talking about them forever but this blog is about the missing pieces. Regardless of the methodology at play, one should not forget that 3D not just derives but relies on 2D data. If we donot have the snapshots of a particular scene/object from certain angles, it becomes very hard to model them. In case of Gaussian Splats,if we donot have ground truth captures for a model from certain angles, the mean points for gaussians in that particular region will be missing, which means that the gaussians surrounding that area will now try to fill that region up, which either means artifacts in the final reconstruction or poor splat generation.

Here is a snapshot of a 3D reconstruction of a temple in Orissa - Somnath, which was done using fewer images than required:

[source - Image from Author]

Diffusing it away

Okay so let us first round out the problem that we have here - We are missing snapshots from certain angles or poses, and hence the view generated from is full of artifacts. What if, there was a way to take a snapshot of that view with artifacts and also obtain the camera pose along with it, because remember - for most 3D reconstrucion pipelines, we don’t just need the images, we also need the camera poses, so we also need a way to sort of first obtain the novel camera position. The approach we are gonna follow will be something like:

Given the training or the ground truth camera positions already present in out dataset, we can interpolate to find camera positions that are not already there.
From the novel camera poses that we obtain, we would then need a way to sort take a snapshot to get the novel view which might contain artifacts from the surrounding gaussians
Once we have this novel view, we need to think of a way to get rid of these artifacts, and obtaining something closer to the ground truth view.

Now I don’t about you, but in hindsight, to me this feels like an elegant idea that sounds so trivial. This is partly inspired from a few research papers I came across and also inputs from my professor and mentors.

[source - Image from Author]

So having established the idea, let us see the ways in which we can attain the aforesaid goals. Now interpolating and taking snapshots can only happen once we have rendered the model as a gausian splat. And for the rendering part, we have NerfStudio, and using the python API we can even automate the process headless.

We can interpolate from the existing camera positions and then rasterize the splats to obtain the snapshot, so that we end up with additional camera positions and their corresponding novel views, which can be used downstream for training. But there’s a catch, these images that we have obtained have artifacts so we need to find a way of fixing that.

Controlled Diffusion

I am pretty sure you must have experimented with Ghibli art style when the trend was hot, but no we aren’t gonna talk about it today. We are gonna take up it’s predecessor - Stable Diffusion. Diffusion is capable of generating realistic images just from noise, essentially it learns that mapping of being able to “diffuse” noise in just the right amounts to end up with an image. But then if you have ever used any of those awesome demos online, you would have noticed that “A happy Koala made out of Blueberry Cake” doesn’t always yield the same image, there are variations to every generation, unless ofcourse we make the models deterministic by fixing the random samplers and the seed.

And by now you must have guessed it, that if we plan to use a diffusion model to “fix” the artifacts or say “diffuse” them away predictively from our novel views, we will need some way of controlling the process. That is where another one of Deep Learning models come in - ControlNets!

ControlNets essentially aim at more informed or controlled image generation from diffusion models. Imagine your cat posed really beautifully the other day, and now you’re craving some more of the same 🐈. And ofcourse the mean creatures that they are, your cat is refusing to recreate the master-piece, add to that the utter difficulty it has been to explain the pose to a generative model. That’s where ControlNets come in handy, you can pass the image of your cat along with some other variants of it like Canny Edge, Depth etc. and then “condition” your model to then generate a cat in that particular pose.

src

This miraculously allows us to condition diffusion models to “fix” the image containing artifacts by passing the nearest ground truth image, their depth maps(denoting the distance from the camera), the confidence map(what is the confidence score of each gaussian in the captured snapshot) along with the view that we need to fix, and viola, the model should fix this, and even though right now we are still working on this, I am super excited about this line of research, because just being able to spin models without even rendering them and them fixing those “broken” views via “controlled” generation just blows my mind, what about you?

The Great Gauss to the rescue

Why make a Mess Mesh?

When they borrow the invisibility cloak…

Diffusing it away

Controlled Diffusion

Enjoy Reading This Article?