Create Custom Scheduler on GKE for Pod Spreading

Published in

Searce

7 min readFeb 7, 2020

Let’s start off with a question that has perpetuated throughout humanity, and conceived the very concept of logic: Why?
Or more specifically (and less dramatically), why do we need this solution?

Say you’re given the task of deploying a production level stateless application on Kubernetes (For this article, it’s GKE) with 2 very familiar criteria:

Maximum uptime
Minimum costs

How would you go about achieving this? Well, to minimize costs, I would utilize PVMs. PVMs are substantially cheaper than regular VMs but with a few caveats. You wouldn’t want to go all in on PVM, they can be taken away from you at a 30 sec notice. So for production grade yet cost efficient deployment — you generally use a mix of On-demand and PVM nodes and incorporate something like a PVM-killer to avoid mass-service disruption.As an aside, checkout this article for more details on cutting costs on GKE

So far so good, but let me show you an example of why we may need a custom scheduler. Say we have an application running on a 3 node cluster (single preemptible node pool)

Let’s pretend we have an imaginary on-demand node pool where we keep a single replica of each pod as a backup but it is already strained with heavy resource utilization, and dumping additional pods on it would lead to a drastic reduction in application response time. Even though relying on that single replica during heavy traffic is a last resort measure, it is something we should try to avoid.

A quick look at my pods shows me this

kubectl get pods -o wide

On close inspection, I notice that all of my the pods belonging to my productpagemicro-service are assigned to the same node!

This means that if that node were to be preempted/terminated, my productpage micro service would be entirely down or severely impacted as all the users get routed to poor stand-alone replica in the on-demand node pool.

The kubernetes default scheduler prioritizes balancing node utilization over anything else, and this is why we need the custom scheduler that prioritizes the spreading of pods across nodes, so that even if a node gets preempted, at least one replica of each micro-service will be available on another node.

Now that we’ve answered the WHY, let’s get to the HOW.

STEP 1

The scheduler will run as a pod in the kube-system namespace, and we need to ensure that the scheduler pod only runs in the OnDemand Node Pool for obvious reasons.

So first, let’s package the scheduler binary into a container image.
Clone the kubernetes source code from GitHub and build the source.

git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
make

Note: Make sure you change the branch and select the same version of kubernetes that you are using else the build will fail, in this case, we are using v1.13.

STEP 2

Now, we’re going to create the container image that contains the kube-scheduler binary that we just cloned into our local system in the previous step. Use this Dockerfile to build the image.

FROM busybox
ADD ./_output/local/bin/linux/amd64/kube-scheduler /usr/local/bin/kube-scheduler

Save this in a file called ‘Dockerfile’, build the image and push it to GCR

docker build -t gcr.io/<project-id>/my-kube-scheduler:1.0 .
gcloud docker -- push gcr.io/<project-id>/my-kube-scheduler:1.0

where <project-id> is the unique ID assigned to your particular GCP project.

STEP 3

Great, we have our scheduler image in our registry, as of now, it’s still the default scheduler.

Let’s go ahead and create the Kubernetes resources (a Service Account, a Cluster Role Binding, and of course, the Deployment) required for the deployment of the scheduler.
Put the following in a file called “my-scheduler.yaml”

Now run the command below and you’ll see a my-scheduler pod deployed in the kube-system namespace. But it won’t be functioning, not just yet.

kubectl apply -f my-scheduler.yaml

--use-legacy-policy-config=false :- Allows us to use our own custom policy config file with our own custom priorities for scheduling

--policy-configmap=scheduler-policy-config :- We will create this configmap in the next step which will describe the characteristics of our scheduler so remember the name of the configmap you have typed.

image: gcr.io/<project-id>/my-kube-scheduler:1.0 :- Replace this value with scheduler container image you had tagged and pushed to GCR

Oh, I almost forgot, do remember to add this chunk of text to your pod spec in the above deployment yaml. (I’ve kept this chunk separate from the above yaml file because you may want to use it for other purposes as well.)

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cloud.google.com/gke-preemptible
          operator: DoesNotExist

This will ensure that the scheduler pod will never be deployed on a preemptible node.

STEP 4

The next step is to create the configmap. We are going to create a policy file, and create a configmap from that policy file.
Create a file called policy.cfg that looks like this

{
"kind" : "Policy",
"apiVersion" : "v1",
"metadata" : {
    "name": "scheduler-policy-config",
    "namespace": "kube-system"
    },
"predicates" : [
        {"name" : "PodFitsHostPorts"},
        {"name" : "PodFitsResources"},
        {"name" : "NoDiskConflict"},
        {"name" : "NoVolumeZoneConflict"},
        {"name" : "PodToleratesNodeTaints"},
        {"name" : "MatchNodeSelector"},
        {"name" : "HostName"}
        ],
"priorities" : [
        {"name" : "LeastRequestedPriority", "weight" : 1},
        {"name" : "BalancedResourceAllocation", "weight" : 1},
        {"name" : "SelectorSpreadPriority", "weight" : 10},
        {"name" : "ServiceSpreadingPriority", "weight" : 1},
        {"name" : "EqualPriority", "weight" : 1}
        ]
}

Then run this command to convert the policy file into a Kubernetes configmap

kubectl create configmap scheduler-policy-config --from-file=./policy.cfg

The configmap name should be the same as that mentioned in the scheduler deployment file

Let me explain a bit about the policy file. Before considering any of the priorities, the scheduler looks at the predicates and ensures that each predicate is fulfilled before deciding where to schedule the node. You can consider it as the preliminary round of node filtering where the scheduler looks at whether the pod satisfies certain constraints such as “PodFitsResources” and “PodToleratesNodeTaints” to mention a few.

Once these conditions/predicates are met, the scheduler then decides on which node to schedule the pod based on the weight-age given to the priorities. In our case, I have set the “SelectorSpreadPriority” weight as 10, and left the remaining as 1 so that the scheduler increases the priority it gives to spreading pods across nodes as compared to the other priorities. Likewise, you can adjust the weightage of other priorities to fit your custom scheduling needs, such as increasing the LeastRequestedPriority so that pods get scheduled on the nodes which have a lower pod request rate.

STEP 5

If you’ve been following the steps correctly, you’ll notice that when you ran kubcetl apply on the my-scheduler.yaml file, it also created a ClusterRoleBinding with the specific ClusterRole called system:kube-scheduler in the kube-system namespace.

For our scheduler to use the configmap that we have created, we need to assign its service account the appropriate permissions for the same, and for that, we need to edit the ClusterRole. So let’s run this command:

kubectl edit clusterrole system:kube-scheduler -n kube-system

add ‘configmaps’ as a resource under the api groups and for verbs, give it: ‘*’, that is, all permissions for configmaps.

The cluster role is also lacking permissions for the api group storage.k8s.io, hence, you’ll have to add this as well to that role

STEP 6

Good news folks, we’re almost done!

Restart the my-scheduler pod in the kube-system namespace, it will now be able to use the custom policy in the configmap that we had assigned to it. Make sure it is running fine, you can even check the pod logs if you want.

And that’s pretty much it, all you have to do now, is specify the scheduler in the pod spec like so:

apiVersion: v1
kind: Pod
metadata:
  name: annotation-second-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulerName: my-scheduler
  containers:
  - name: pod-with-second-annotation-container
    image: k8s.gcr.io/pause:2.0

I added the custom scheduler to my productpage deployment spec, and ta-dah!

Each replica is on a different node

You can even use multiple schedulers; different schedulers with different priorities for different pods and a different k8s experience.

I hope this solution is able to resolve an issue you may have, or at the very least, you’ve learned something new today :)