Breaking down SREGym!

So our initial PR (which closely mimicked the Reddit PI Day Outage) was not novel enough, partly because I had missed a few existing issues that were created recently and also because I ended up framing the PI day outage itself in the wrong way. But we are back!! With another issue, had to dig a few places for these because the rate with which the candidates are proposing issues and patching them up has become breath-taking. Anyways, let’s get to it:

So the problem stems from a issue that was opened in the official Kubernetes repoitory - #41896, and which was subsequently patched in the PR - #117927. The crux of the issue it this:

Pods in Kubernetes are the smallest self-sustaining unit of kubernetes which can run an application, which we have already seen in the previous post. And a pod may contain one or more docker containers which it manages. Now, when a pod starts, it’s configuration specifies the container images it requires and the container then pulls those images to get the application running. And as you would know, docker containers can either be public or private. Public docker containers, do not need any authentication to be pulled, while private containers do - the term for the authentication keys is “image pull secrets”, and the issue #41896 pointed out that in earlier versions of kubernetes, if a any of these image-pull-secrets were missing, the pod would fail start BUT WITHOUT ANY WARNING or LOGS. And this was exactly that was patched in #117927.

So the challenge for the agent would be investigate the cluster, figure out that the pod is failing to start because of missing image pull secrets and is emitting ImagePullBackOff Errors - and then check the specific reasons for the cluster’s ImagePullBackOff Errors - which would be missing credentials, find those missing credentials and then patch the broken pods to ensure that the cluster is back up again safely and it works. But ofcourse there are many more intricate details to this, because we must simulate the entire thing in the SREGym environment for the agent to be able to work in, and this time we went on to tackle a pretty interesting challenge - trying to simulate the problem for local kind-based clusters + cloud based aws instances, so let’s get into it:

The initial setup

So to list out the things we must get done:

  1. Launch the hotel-reservation app, wait for it to be up.
  2. Now, as default, the application just pulls from public docker images available in docker hub and don’t really need a secret or an auth key.
  3. Sooo, we must take the public image and then somehow, make it private!!! There is an obvious way to do this - just pull from a private docker image, but that comes with the headache of having to maintain the secret and keeping track of auth changes.
  4. As highlighted above, we cannot just switch to using a private docker image, so instead, we pull the image and push it to a private docker registry that we create in our self-hosted cluster and assign it an auth key. Clever, eh?
  5. The entire inject_fault() pipeline involves deletion of the image pull secret, adding the public docker image to the private one and then pointing the pod to pull from the private docker image instead of the public one.
  6. A key caveat here is that the mitigation for this broken image pull secret involves reconstructing the image pull secret, but how does an agent reconstruct the secret if it has never seen it? Random guessing won’t cut it I think. So, we also leave breadcrumbs pointing at the credentials registry with enough information for the agent to reach and reconstruct the secret.
  7. Once the secret has been added to the namespace, the agent must force the the pod to restart to pull the image from the private docker registry, and the most straightforward way of doing it is to just delete the pod, and upon restart, it is forced to pull the image again with the udpated image secret.
  8. Another sort of a tangent that we would be taking later in this article would be about setting up this cluster on aws - which trust me was a pain.

Defining the problem and simulating it in SREGym




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Branch Prediction: From CPUs to GPUs and TPUs
  • ULT: Unifying Teacher-Student RL with Transformers
  • Breaking down SREGym!
  • Let's Paint! Shall we?.
  • A simple and intuitive guide to using uv - an awesome tool from astral!