How Scheduling Works in Kubernetes?

How Scheduling Works in Kubernetes?

Scheduling is a core concept in Kubernetes and orchestration in general, ensuring that pods are assigned to nodes that can handle their resource demands and maintain the overall cluster stability is a crucial task. In this article we will look into different steps that take place to assign a pod to a node, as well as the different techniques to modify the scheduling behaviour, we will cover concepts such as node affinity, taints & tolerations, as well as the different scheduling policies.

Basics of Scheduling in Kubernetes

Scheduling in Kubernetes is handled by the kube-scheduler, a special pod that works within the control plane. Its main task is to find the best node for a new Pod. This happens in two steps:

  1. Filtering: First, the kube-scheduler narrows down the list of available nodes to a smaller group of possible candidates. These nodes must meet all the scheduling requirements specified for the Pod. For example, the PodFitsResources filter ensures that a candidate Node has enough resources to handle the Pod's demands. We will talk about the different mechanisms of filtering nodes within Kubernetes.

  2. Scoring: Once the possible candidates have been identified, the kube-scheduler uses scoring functions to evaluate each node's suitability. These functions consider factors like node health, proximity to other Pods, and available resources. The node with the highest overall score is chosen as the final placement for the Pod.

The scheduler in Kubernetes is quite flexible, the scheduling policy can be altered through different mechanisms that we will explore later on in this article, but more importantly, we can build and use a custom scheduler in any language: Go, Python, Rust, etc. The scheduler itself is a pod that runs in the control plane of K8s.

It is also possible to select a specific type of node within the pod definition (YAML configuration), for example, if we wish to select only nodes that have an SSD disk, we can use the nodeSelector field in the Pod spec as follows:

 nodeSelector:
    disktype: ssd

It’s worth noting that a pod can specify a given node in its definition, this is uncommon and not recommended though, because the scheduler can already do a reasonable job at assigning pods to the right nodes. If one particular node is specified, the scheduler will directly assign the pod to that node.

Phase 1: Filtering and Selecting Nodes

Node Affinity

Affinity is a mechanism in Kubernetes that gives more flexibility to assigning pods to nodes, it's more expressive and more powerful than the nodeSelector field.

We can specify multiple rules with logical operators: "In", "NotIn", "Exists", "DoesNotExist" as well as "Gt" and "Lt" for greater than and less than, respectively. Besides these logical operators, we can specify two types of affinity: a preferred (soft) affinity and a required (hard) affinity.

With preferred constraints or soft affinities, the scheduler will try to find a node that meets these affinities, but will still schedule the pod if no matching node is found. In contrast, hard affinities represent mandatory constraints, the scheduler won't assign a pod to a node that doesn't match the specified requirements.

To better understand the affinity concept, here is an example: Let's consider a scenario where you have a Kubernetes cluster with nodes of varying amounts of RAM, and you want to ensure that a specific set of pods, which have higher RAM requirements, are scheduled on nodes with ample memory. Your nodes have the following values for the key memory-type : high , medium and low . If we want a specific pod to run only on nodes that have a high or medium memory space, we can add the following affinity to the pod spec:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: memory-type
            operator: In
            values:
            - high
            - medium

The field requiredDuringSchedulingIgnoredDuringExecution means that the following affinity is a hard one and needs to be enforced during the scheduling phase. If we want to make this affinity a soft one, we should use the preferredDuringSchedulingIgnoredDuringExecution field.

Anti-Affinity

Anti-affinity is like making sure your team members don't sit next to each other in a meeting. It's a way to spread out your applications or services across different computers in the cluster, so if one computer has a problem, the others can still do the job. The field antiAffinity can be used along with requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution to inform the scheduler whether it's a hard requirement or a soft one.

A simple yet illustrative example would be a chat app with two parts: the frontend that users interact with and the backend that stores messages.

You want to make sure that if one node fails, both the frontend and back end don't go down together. You can set up anti-affinity rules to prevent both parts from running on the same node. This way, even if a node has an issue, the other node still has a working copy of both the frontend and backend.

Taints & Tolerations

Useful concepts in Kubernetes scheduling are taints and tolerations. Taints are a node property that can repel some pods, while tolerations are a pod property that allows them to be assigned to a node with the matching taint.

To add a taint to a node, we can use the taint nodes command, we need to provide key-value pairs that will identify the taint.

kubectl taint nodes node1 key1=value1:NoSchedule

This adds a taint on node 1, with a key key1 and value value1 with a NoSchedule effect. To schedule a pod on this node, we need to add a toleration to its specification:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

An example use-case of using taints and tolerations is when having a cluster with some nodes equipped with a GPU, it is preferred to not schedule pods that don't require a GPU for running on those nodes and leave them for upcoming workloads that would require a GPU. In this case, we can have a taint

Taints & tolerations are powerful for having granular control over where to schedule pods, you can read more about it in the documentation.

Phase 2: Scoring Candidate Nodes

After the first phase of selecting suitable nodes based on different criteria (node affinity/anti-affinity, taints and tolerations or a simple nodeSelector ) comes the scoring phase. The scheduler will try to attribute a score for each node to identify the single best node to use. In the case of a large cluster, the scheduler will take an exponential time to score all nodes, luckily, the scheduler provides the parameter percentageOfNodesToScore which sets a limit to the number of nodes to calculate their scores. By default, Kubernetes sets a linear threshold between 50% for a 100-node cluster and 10% for a 5000-node cluster. The minimum of the default value is 5%. It ensures that at least 5% of the nodes in your cluster are considered for scheduling.

When defining affinities, we can attribute weights to each affinity, for each affinity/rule, the scheduler will add the weight to the base score of the node if it respects the affinity.

Scheduling Profiles

We can customize the way that K8s scheduler scores the nodes by building scheduling profile plugins, the diagram below shows which functions can be extended in the API (source: Kubernetes Documentation) :

If we want to implement a custom scoring method, we can use the Score method, PreScore() is a hook that is triggered just before the scoring phase, it's useful to provide the state for the Score() function.

Conclusion

Kubernetes has a powerful scheduling mechanism and provides different techniques to alter the scheduling behaviour without having to implement a full scheduler, using affinity/anti-affinity, taints and tolerations can go a long way. However, if we still need a custom scheduling strategy that cannot be achieved with the standard approaches, we can extend the scheduling API by through plugins.