Let's Paint! Shall we?.

Well, 2025 has been the year for 3DGS, with a lot of research coming out in this field. Especially it has been the year of transformer-styled architectures that are capable of accelerating one or the other aspect of the 3DGS pipeline. Specifically, SFM or Structure from Motion, which used to take hours via traditional pipelines like colmap and AliceVision, were sped up by nearly factor of 1000x. When I started writing this blog, it was meant to be a discussion on a rather simple way of removing floaters from 3DGS, but as I delved deeper into the topic of 3DGS itself, I realized there’s a lot to it that I never fully understood, so let’s make this blog all about that shall we.

Gauss to the rescue(everywhere)

There are a few very influential people whom I absolutely love and admire. And most of them have stemmed out of the discoveries I make every now and then related to the field to math. And one of these people is ofcourse - Carl Friedrich Gauss. And the admiration has to do with the fact that personally, for me, whenever a field starts going down a whirlwind of complexities and intractable proofs, his work comes to the rescue in one way or the other. Let me just quickly highlight some of them:

  • He gave a way of visualizing Complex Numbers as 2D vectors, made them much less “imaginary”
  • Normal Distribution - this shows up everywhere, a simple elegant idea that even the most random selection of data is centred around a mean
  • When you are banging your head against a wall trying to solve a system of equation - Gaussian Elimination to the rescue.
  • And ofcourse, when you realize that rendering Radiance Fields via neural networks is eating up too much of your time, just model them using Gaussians - Gaussian Splatting.

Not to discredit the man, but the last one is instrumental to us here. 3D Gaussian splatting - as the name suggests, it aims to model a scene as a collection of Gaussians, more specifically, a collection of anisotropic gaussians. The word anisotropic is important there, but we will break down both of these first:

  1. Representation of 3D Scene as a Gaussian - First off, what is a gaussian actually? The way people talk about it, it’s very easy to get come away with the intuition that a gaussian is actually a “blob” in 3d Space - well kinda, but the blob and all it’s properties are simply values that we get from the gaussian function: \(G(\mathbf{x}) = \exp\left(-(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right)\) where $\mu$ is the mean of the distribution - can be thought of as the center or the region where most of the mass of the blob is concentrated, and $\Sigma$ is the covariance matrix - for starters, Covariance basically tells us how dependent two variables are on each other, and when done for each pair of values as in a Covariance matrix, we end up getting how “spread” the data is.

Now being anisotropic is what’s important here, it basically means that the gaussian is not a sphere where there’s equal spread of influence and information in the three directions - but rather they can vary, the gaussian is now an ellipsoid and it can stretch more in one direction than the other. This is achieved via the Covariance Matrix. For a isotropic or spherical gaussian, the covariance matrix is simply a diagonal matrix with equal values on the diagonal hence it is equally spread along each of the three axes and this matrix just reflects the scaling operation across each axes. \(\Sigma = \begin{bmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \sigma^2 \end{bmatrix}\) But for the anisotrpic gaussian, the covariance matrix can be decomposed as the product of a rotation matrix and a scaling matrix, i.e. $\Sigma = R S R^T$ where $R$ is a rotation matrix and $S$ is a diagonal matrix with the diagonal elements being the scaling factors along the principal axes(x,y,z) of the ellipsoid.

To understand how this plays out - imagine a scene where you have a long stick, now if you were to model this stick as a gaussian, you would want the gaussian to be elongated along the direction of the stick, hence the scaling matrix would be such that it scales the gaussian along the direction of the stick, and the rotation matrix would be such that it rotates the gaussian to align with the direction of the stick. This would prove to be diffcult if we were to use isotropic gaussians to model the stick, as we would need a lot of gaussians to model the stick, and even then the stick would not be modeled as accurately as we would like.

There is one more piece to the puzzle : colors. The Gaussian function we just described can only represent a single quantity, which is how much does a given Gaussian contribute to a given point in space. But we need to represent colors as well. If you think of it, there are a few ways in which this can be done:

  1. Fixed color assignments to each 3d point in space, where each gaussian is assigned a color, and the color of the gaussian is the color of the point in space that it is modeling. And as you’d expect this would be very “brittle” in the sense when you rotate the camera or move around the scene, the color of the gaussian would not change, and hence the scene would not look realistic.
  2. View dependent color representation - So based on the angle you are looking at the gaussian from, the color changes. This is achieved by using something called as “Spherical Harmonics” modeling the color as a function of the viewing direction, which is basically the vector pointing from the camera to the gaussian. So you ask if I am looking from this direction what color would this part of the gaussian be in, spherical harmonics answer that.

Besides the above attributes, the gaussians in a 3DGS have a few other additional attributes like opacity, and scale. Opacity is basically the probability that a ray of light will be absorbed by the gaussian, and scale is basically the size of the gaussian. To understand what floaters are however, we need to understand the optimization process behind 3dgs: The question is, given that you are representing the scene as a collection of gaussians, how would you possibly go about making the gaussians represent the scene as accurately as possible? Well, the most straightforward way to do this is to optimize the parameters of the gaussians such that the rendered image is as close as possible to the ground truth image i.e. what are looking for is a function that tells us how a given set of gaussians represents a given image, and the function should be differentiable with respect to the parameters of the gaussians. So this process of rendering the image from the gaussians must be differentiable, let’s see how that is done.

As we have seen, each 3D gaussian is defined as: \(G(\mathbf{x}) = \exp\left(-(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right)\) So think about what steps could possibly be involved in projecting or rasterizing this 3D gaussian onto a 2D image plane. First and sometimes not the most obvious one is to transform the gaussian from world space to camera space. What is the world space and what is the camera space now?

So the world space is basically where each point has 3d Coordinates akin to the actual 3d coordinates of the point in the real world. Sure we choose a certain point as reference, but the relative positions of the points are the same as in the real world. The camera space is the space where the camera is at the origin and the z-axis points in the direction of the camera’s view. This makes the following things easier:

  1. We know now that whatever is in front of the camera will have positive z-coordinates and the ones behind the camera can be easily ignored while the rasterization process.
  2. The math of actually projecting the gaussians is made much simpler, which we will see how.

So essentially, we have a few gaussians in 3d space, and we want to project them onto a 2d image plane. Now remember that even though a “gaussian” is a blob in 3d space, it is nevertheless a function which we are querying at every point and estimating how much of that gaussian is present at that point, and because we can query the function for any value of x, it makes the gaussian representation continous. So, when projecting the gaussian onto a 2-d plane, insted of transforming every point, can we transform the function itself and obtain a function which we can then query at every 2d point to obtain the rendered image?

So to put it in steps:

  1. Transform the gaussians from world space to camera space - remember this transformation makes the camera the origin of the scene and the depth axis is along the direction of the camera’s view.
  2. Project the gaussians onto the 2d image plane, here we don’t transform the function itself, but transform only the attributes of the gaussian onto 2d plane. So, essentially we just transformed the query function in 3D to a query function in 2d. Let us look at the projection steps in detail to understand how this process is differentiable, which essentially is what all the fuss is about. The covariance matrix is first transformed into the camera view from the world view: \(\Sigma_L = W \Sigma W^T\) where $W$ is the rotation matrix that transforms the world space to camera space, again this simply gives us the benefits that we have described above of the depth being nicely scaled. But remember how I said that this transformation also makes a lot of projection math easier?

Well, so essentially projecting a 3d gaussian into 2d means obtaining an equivalent function in 2d which represents the same gaussian as seen from the camera, so we will just transform the mean $\mu$ to $\mu_L = W\mu$ and the covariance matrix $\Sigma_L$ to 2d space. The mean of any gaussian in 3d space is just a point(x,y,z) so projection can be done via simple projection equations: \(x' = f_x \frac{x}{z} + c_x, y' = f_y \frac{y}{z} + c_y\) where $z$ is the depth of the point, $f_x, f_y$ are the focal lengths of the camera and $c_x, c_y$ are the principal points of the camera. What is essential to note is that projection from 3d to 2d is mostly just the scaling operation i.e. division by $z$ and then a linear transformation.

However, the covariance matrix is a bit more complicated. You cannot just divide the matrix by the depth $z$ and expect it to be the same as the 2d covariance matrix, because the covariance matrix is describing a lot more than just position - it is describing the spread of the gaussian in the three directions and also how it is oriented in 3d space. And here we pull off a trick or rather a mathematical superpower. You see, the projection function involves division by $z$ and when looked from a global perspective, this is essentially a non-linear operation, and it is hard to apply a non-linear operation to the covariance matrix. Because if the transformation were linear, we could leverage the following property of matrices: If $\Sigma$ is a covariance matrix of a random variable $X$, then the covariance matrix of $AX$ is $A\Sigma A^T$, where $A$ is a linear transformation applied to $X$. In our case $\Sigma$ is the covariance matrix of all the points in the 3d gaussian, so if we could express this projection operation(division by $z$) as a linear transformation, we could simply apply the above property to the covariance matrix and obtain the 2d covariance matrix.

And would you know, that this transformation is given by the Jacobian matrix of the projection function? Bear with me on this, we will see how the jacobian came into play here. So essentially the covariance matrix is telling me how much the blob spreads in 3d space, and I want to know that if a blob is spread by $\Delta x$ in 3d, what is its spread in 2d i.e. $\Delta u$. From the projection relation above: \(\Delta u \approx \frac{\partial u}{\partial x} \Delta x = \left( \frac{f_x}{z} \right) \Delta x\) Then you think about, what matrix carries the partial derivatives of the projection function? Well, that is the Jacobian matrix of the projection function. \(J = \begin{bmatrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} & \frac{\partial u}{\partial z} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} & \frac{\partial v}{\partial z} \end{bmatrix} = \begin{bmatrix} \frac{f_x}{z} & 0 & -\frac{f_x x}{z^2} \\ 0 & \frac{f_y}{z} & -\frac{f_y y}{z^2} \end{bmatrix}\)

And there is a hidden trick that might go unnonticed if you are not careful. Technically the projection depth division needs to happen for all the points and all the values of $z$ and should happen globally, however a single gaussian is small and inside that very small space represented by the gaussian, when we look at the curve plotted by the depth values $z$, it nearly approximates a straight line, so in the above matrix, \(\frac{\partial u}{\partial x} = \frac{f_x}{z}\) where $z$ is the depth of ONLY the centre of the gaussian, which means that we have found the linear transformation that we were looking for, the Jacobian Matrix gives us the linear transformation that can help us translate the 3D Covariance matrix into a 2D Covariance matrix, and as discussed above, the 2D Covariance matrix is given by, we can use the amazing property in linear algebra to calculate the 2D Covariance Matrix as: \(\Sigma_{2D} = J \Sigma_{3D} J^T\)

  1. Pheww, that was a long one, a subpoint just turned into a entire blog. Anyways, so the transformed attributes of the 3D gaussian splat, help us approximate how each point is influenced by the 2d gaussian splat. But remember here that the for a single region in 2d space, it will not just be one gaussian contributing, but a host of gaussians, so we must also have a way ot estimating the order and the contribution of each gaussian to the final color of that pixel. Now the way this is done is rather interesting:

So here, the algorithm uses a trick called tile-based rasterization. Essentially the idea is that we divide the entire region into tiles of certain dimensions say 16 x 16 units, and then check the contribution of each of the gaussian to that tile. This prevents us from having to evaluate each gaussian for each scene, instead we only evaluate the gaussians that land on that tile, then the gaussians are sorted based on their depth which is easy now thanks to the camera view translation that we did earlier. Interestingly, this sorting is not done for each tile in an isolated manner, rather it is done on the GPU via Radix Sort and on GPUs Radix Sort works like a charm for large numbers. And in our case, the numbers are obtained from the depth values of the gaussians and also the tile they are on. So each gaussian is sort of given a “key” which mathematically is basically: \(Key = (Tile\_ID \ll 32) \ | \ Depth_{int}\)

To clear the jargon of binary operations there, we left shifted the tile ID by 32 bits and then performed a bitwise OR operation with the depth of the gaussian. So imaine a particular tile say 001 in binary, left shifting it by 32 bits will give us 001 followed by 32 zeroes, and then we OR it with the depth of the gaussian, which essentially means that we are appending the depth of the gaussian to the tile ID. And the GPU takes this ID for all the tiles, and sorts them all together. I will not be going into the details of Radix Sort here, but the key is that it works extremely fast for large set of numbers, check a tutorial out here - Radix Sort

Once the gaussians are sorted by their depth and associated with their tiles, the GPU splats them or rather paints the picture as follows:

  • Each tile is given a separate GPU Thread, which handles the painting part for that tile.
  • Painting here just means being able to identify what is the final color and opacity of a particular pixel in that tile due to the collected gaussians, and it is done via the formula: \(C = \sum_{i \in N} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)\) well hold up, I will break it down for ya: i. $c_i$ is the color of the i-th gaussian. ii. $\alpha_i$ is the opacity of the i-th gaussian. iii. $\prod_{j=1}^{i-1} (1 - \alpha_j)$ is the “Transmittance” that defines how much light is left over after passing through all the gaussians which are in front of the i-th gaussian.

If you think about it, all of the above steps save a lot of time and computation on the system, first off, we each thread handles one tile so the process is extremely parallelized, and secondly, when calculating transmittance, there will be a certain number of gaussians post which the transmittance nears zero i.e. no light passes through hence the GPU does not even need to process these gaussians, unlike NeRFs where the MLP is queried for every point in space, even empty space everytime you want to render the scene from a particular viewpoint.

That’s it, that is essentially the entirety of the “splatting” pipeline, now what remains essentially is comparing the rasterized image with the Ground Truth image and calculating the loss - which can then be used to update the parameters of the gaussians via backpropagation. Hmm..Actually I wanted to discuss floaters in this post, but I guess we will need to reserve it for a new one >_< !!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • A simple and intuitive guide to using uv - an awesome tool from astral!
  • A quick overview of Byte Pair Encoding tokenizers!
  • Delegation, Discomfort and Decisions!
  • Watch and Learn, forget the speech!
  • Lights❌ Language✅ Camera...Action