Breaking down SREGym!

So a couple of days ago, I get an email from a professor at UIUC asking if I would be interested in working with him on a project related to SRE evaluation. Well, it was sort of an evaluation task for a bunch of candidates including me, which we had to pass if we were to work under him. This is the breakdown of this:

Background

SRE or Site Reliability Engineering is a discipline that aims at reliably detecting points of failure in a system - be it software or hardware, diagnosing and then mitigating them. Think of it like a very methodical way of carrying out a bugfix in large scale systems.

Now ofcourse in development, you can’t just target real systems running off the shelf programs, so we need to create “simulated setups” or clusters so to say. So in order to do that, we use - yes you guessed it - Kubernetes. Kubernetes is what software engineers use to deploy, scale and manage their applications in a distributed manner - which can either be on the cloud or on your own machines. The key things to keep in mind here are the “nodes” and the “pods”.

Nodes are the virtual machines that run your application. Pods can be thought of the smallest self-sustaining unit of kubernetes which can run an application. To make things easier, you can even connect this to a life-pod that is so dear to sci-fi writers, each pod is supposed to have all the required resources to run a particular application, which means storage, network resources etc. And these “pods” so to say are designed to run as together as a logical unit.

Example is better?

Okay, so instead of absolutely boring the hell out of you by talking about all the theory here, I should better illustrate this using an example from the SRE benchmark tool itself. Let’s take a sample scenario which requires an SRE saving : Hotel Reservations Systems

Okay, let’s be a bit more specific - Service Port Conflict in Hotel Reservation System. But as promised, I will breakdown everything. So in a framework or design such as Hotel Reservation System, a service is essentially a component or a part of the entire system that does a specific function. So for instance “Booking Service” is a service that handles the booking of rooms, “Luggage Service” handles any additonal “baggage” (not emotional dw) that you might be bringing, “Payment Service” handles the financial aspects etc. Now in order for the system to work as desired, each service must be able to listen in for API calls and requirements from other services and respond to them. And that is where the concept of “ports” comes in.

You see, just like we humans don’t want strangers to just throw in letters through our bedroom or dining room window, but through a mailbox, services also only listen on specific addresses called “ports”. So for instance, “Booking Service” might be listening on port 8080, “Luggage Service” on port 8081 and so on. Sooo, as the title would have clicked by now, service port conflict is essentially the problem where two or more services are assigned the same port and just like how my neighbor can’t use my mailbox, two services can’t use the same port. That’s it, that is the problem.

Stepping through SREGym

Accessing the CLI: First we will drop into the cli of SREGym, oh and if you want to follow along, you can go checkout their readme for setup instructions, this article is more of a deepdive into ONLY the technical side of things.

So we start the cli version of SREGym which allows us to directly start processes, analyse the issues, then check if our assesment is correct and then mitigate them.
Starting the Task: When we start the hotel_reservation task here, here’s exactly what happens:
- When you start a simulated problem, it first sees if there is anything leftover from a previous run, like maybe stray ghost processes from any previous problem ID that might have run, or containers etc. It cleans it up.
- Then comes the life-pod role, it starts the necessary services - Prometheus, which is essentially the tool for them to be able to log the various metrics and system information, and Jaeger, which tracks how requests are flowing through the system, requests essentially are the messages travelling from one place to the other, Loki is for log aggregation, but it is different from the logs prometheus gives us.
- Prometheus is specifically meant for gathering metrics, while Loki is meant to gather the various other system logs. So while Prometheus will give you superficial information - like the metric is too low or the cpu usage is high, rather if you want to check what maybe caused the issue or how it happened, we gotta go for the logs generated by Loki.
- Then we have the OpenEBS which is a storage level system that is used to provide storage to the pods and more importantly, “persistent storage” i.e. storage that remains even if the pods are moved or deleted.
Diagnosing the Issue: Alright so that was all that happens when starting the cluster or in other words - when you are simulating the bug. Let’s talk about the steps to actually figure out what issue there is and how to tackle that. First up we will go about inspecting what pods/services are up to see what is failing, so we need to check what are the various pods related to hotel reservation that the task has started here and check their status. So out of all the pods that get listed one particular one - recommendation-677bbf995f-r8fq is still “pending” i.e. it failed to start - probably because of the port conflict we talked about earlier.
Investigating the Failure: OKay so once we have the system up and running - how do we actually go about diagnosing what is wrong and fixing it? You check logs - to round out the reason behind why the thing has not started. Now the interesting thing about this problem is that you will see nothing in the logs - which is what happens to most systems in the real world - they fail silently. And as a side-tip always keep in mind that a pending pod in kubdernetes will have no logs because it never ran.

So first we do a scan of all the pods as stated above and if something is pending, we ask the cluster - “WHY” as follows:
```
kubectl get pods -A
```
This gives us a list of all the pods that are active. Now very intuitvely, as we said, the pods are essentially individual components of the system that are running the bits and pieces around. And this very first command gives us a pretty good idea of what could be going wrong, we get a STATUS update about all the pods as to which are running and which ones are not.

Now once we get the status of each pod that exists, we know which one is the culprit, the ones with either pending or stopped etc. denote one or the other failure with the system. In the case, our output look something like this:
```
NAME                                            READY   STATUS    RESTARTS   AGE
hotel-reservation-frontend-579c99867f-2584x   1/1     Running   0          19m
hotel-reservation-frontend-579c99867f-57j5j   1/1     Running   0          19m
hotel-reservation-frontend-579c99867f-6568x   0/1     Pending   0          19m
hotel-reservation-frontend-579c99867f-8699d   1/1     Running   0          19m
hotel-reservation-frontend-579c99867f-8z56z   1/1     Running   0          19m
hotel-reservation-frontend-579c99867f-9j57p   1/1     Running   0          19m
```
You’ll see one of the services is still pending and has not really started. Interestingly in SREGym if you are prudent enough and check the startup logs, the problem with this pod can be seen way early since it’s a manually introduced one - you will see the system saying that it injected a service port conflict. However, ofcourse, we are not cheating here, we need to wire the agent to properly identify and then mitigate the issue.

The Oracle

An Oracle is a term that is used to check whether the solution given by the agent is correct or not. As such there are different kinds of Oracles:

As we have seen above there are multiple stages to the problem: starting from identification (problem exists) to diagnosis (why it exists) to mitigation (fixing it).
There are oracles accordingly, out of which the diagnosis oracle is an LLM. We cannot expect the agent or human to tell us why something is failing in the same wording everytime (i.e. if the answer is “The port is stuck at 9100”, some agent may give a elaborated answer, another may give a short one).
So an LLM can better judge whether the given answer is close to the actual one or not.

Key Terminologies

DaemonSets: Controllers that run in the background and that are meant to ensure that a copy of a pod runs on each node in the cluster. For instance if we have a pod that contains a system for monitoring logs - now ofcourse monitoring is required on each node of the cluster and as such we would need a copy of that pod running on each node. This can be easily managed by a DaemonSet. But the downside here is that because it’s job is to ensure that there is a pod on every node, for scenarios where we create more and more nodes, the daemon may keep launching additional pods for each node which will lead to OOM errors.
Docker Containers - ECR: Elastic Container Registry - This is a managed container registry service provided by AWS. It is used to store, manage, and deploy Docker container images. In large companies and fast-paced environments, with every new build/test/added functionality, a new docker image gets built, so they might end up having say, 10000 or higher number of docker images every day. Keep in mind, that mostly each docker image is just an newer version of the last docker container - so you don’t actually need all of them and must keep cleaning them up in order to save space and costs. A way to manage docker containers is via Amazon Elastic Registry which keeps track of your docker images and manages their deployment along with storage and use. But ECR has very limited rules in how the docker images are managed.
Where’s my custom resource: Custom Resources in Kubernetes are services which allow you to handle newer tasks better - basically by defining your own thing. So for instance, to manage a database, instead of using the default services, you can define your own service for it and manage it your own way. And the advantage here is that you get all of kubernetes commands that you would otherwise have if you went with a default option. Now this all seems great especially given that you have all the kubectl commands at your disposal. So the way they went about this was that whenever they wanted to update a custom resource, they would change the status of the custom resource to pending by using kubectl set status = Pending - which would prompt the controller to have a look at the custom resource in an attempt to mitigate the pending resource tag. All was good and great until Kubernetes switched versions. Post v1.12, Kubernetes no longer allowed users to set fields of custom resources via kubectl. So now for an update a custom resource would never go into the pending state which meant that the controller would never look at it - and the update would not go through.
AutoScaling Catastrophe: Autoscaler removes/adds pods depending upon the resource utilization. So for instance, if the CPU utilization of a pod goes above a certain threshold, the autoscaler will add more pods to handle the load. But this can get trickier in cases where a Pod WILL use up more CPU - but only during startup, like basically in the container, no matter what pod you start for the service, it is bound to use up a 100 percent - as part of a warmup phase after which it settles down. Now what ends up happening here is that the autoscaler notices that the pod is under heavy cpu load, so it must increase the number of pods - and those pods start at 100% too - which starts a cycle of the autoscaler adding more and more pods - leading to an OOM error.
Health Checks: So Kubernetes works this way - As we have said - there are individual deployable units - pods ofc. and each pod has one or more containers inside it - the containers run various services essentially. Now there is something called as Health Checks in Kubernetes, which deploys various probes to check on the health of your deployed cluster. And these health checks are per-container. A few sample probes are the Readiness Probes and the Liveness Probes. The readiness probe checks the status of the container and if it has failed, it simply says do not send traffic to this container. The liveness probe checks the status of the container and if it has failed, it simply restarts the container. But now, the thing is that the health of a “pod” depends upon the health of each individual container - so if even one container fails, the health of the pod is considered failed. But this might not be desired when the failed container is not of high importance, something like sshd-agent.
AZ Balancing: Kubernetes scheduler is not very good at AZ balancing for deployments of services. AZ - Availability Zone balancing means that the pods are distributed evenly across different availability zones in a cluster. Now the kubernetes scheduler may sometimes end up distributing loads unevenly across different availability zones - which can lead to some nodes being overloaded while others are underutilized. However the good thing here is that unlike the other issues where you may have to patch the entire kubernetes library and use it, here you can simply replace the scheduler with another image i.e. the scheduler uses a docker image to run.
Calico: An open source networking and network-security solution for containers. It acts as the engine that allows Kubernetes clusters to talk to each other.
BGP (Border Gateway Protocol): A routing protocol used to exchange routing information across networks. In the context of Calico, BGP is used to advertise pod routes between nodes so that they can reach each other directly.
IP-in-IP Mode: Encapsulates pod traffic inside standard IP packets, which allows routing to happen even if the underlying network doesn’t support BGP routes.

Choosing the problem to target

Now, in Reddit pi-day outage, what had happened was that a kubernetes version upgrade had resulted in a label removal from a node - kubernetes uses node labels to determine where to schedule certain workloads. And in case of reddit, the issue was embarassingly simple, Calico was configured to run route reflectors on nodes with a particular label:master, and the kubernetes upgrade removed that label due to which no node could be identified to behave as route reflector which meant that none of the other pods got any information since they were not connected. For the uninitiated, route reflectors are essentially nodes that can direct network traffic to and between different nodes, and I think you would agree that for a social media site like reddit, the thousands of nodes running various features must be able to communicate to each other, but in the absence of the route reflectors, this communication broke down.

And for the first issue, I am thinking we can do something similar as a SREGym evalation task, my plan specifically is: Every k8s cluster has a HPA - an Horizontal Pods AutoScaler, which basically increases or decreases the number of pods depending upon the load. So imagine you have a pod running and it is using up about 80 percent of the CPU, so HPA would increase the number of pods, to ensure the task gets distributed. Now, we can pick an app that already exists in the SREGym benchmark and simulate a situation where the HPA’s reference label breaks down and it is unable to scale up the number of pods again. All the pods run normally here, but any sort of increase in load would lead to the service failing, because the HPA is unable to do it’s job. Maybe not very novel, but it’s a start.