Kubernetes Scheduling: Node Affinity

Omkar Joshi

10^th May 2018

6 mins

Introduction

This article continues from Part 1 of advanced Kubernetes scheduling. In part 1, we had discussed Taints and tolerations. In this article, we will take a look at other scheduling mechanisms provided by Kubernetes that can help us direct workloads to a particular node or scheduling pods together.

Node Affinity

Node affinity is a way to set rules based on which the scheduler can select the nodes for scheduling workload. Node affinity can be thought of as the opposite of taints. Taints when applied to a node repel a certain set of workloads whereas node affinity, when applied to a pod, gets attracted a certain set of nodes.

NodeAffinity is a generalization of nodeSelector. In nodeSelector, we specifically mention which node the pod should go to, using node affinity we specify certain rules to select nodes on which pod can be scheduled. These rules are defined by labelling the nodes and having pod spec specify the selectors to match those labels. There are 2 types of affinity rules: preferred rules and required rules. These two rules are checked & applied at scheduling and later on if there is a change in the state of labels etc. Based on the combination, three distinct policies can be enabled for scheduling decisions.

Preference during scheduling but ignore changes later

In the_ Preferred rule_, a pod will be assigned on a non-matching node if and only if no other node in the cluster matches the specified labels. preferredDuringSchedulingIgnoredDuringExecution is a preferred rule affinity.

Rules must match while scheduling but ignore changes later

In the Required rules, if there are no matching nodes, then the pod won’t be scheduled. In requiredDuringSchedulingIgnoredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node. However, once the pod is scheduled, labels are ignored meaning even if the node labels change, the pod will continue to run on that node.

Rules must match while scheduling and also if situation changes later

In the requiredDuringSchedulingRequiredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node and if the labels on the node change in future, the pod will be evicted. This effect is similar to NoExecute taint with one significant difference. When NoExecute taint is applied on a node, every pod not having a toleration will be evicted, whereas, removing/changing a label will remove only the pods that do specify a different label.

Use cases

While scheduling workload, when we need to schedule a certain set of pods on a certain set of nodes but do not want those nodes to reject everything else, using node affinity makes sense.

Walk-through guide

This assumes that you have cloned the kubernetes-scheduling-examples. Let’s begin with listing nodes.

<code>kubectl get nodes
</code>

You should be able to see the list of nodes available in the cluster,

<code>NAME                          STATUS        AGE       VERSION
node1.compute.infracloud.io   Ready         25m       v1.9.4
node2.compute.infracloud.io   Ready         25m       v1.9.4
node3.compute.infracloud.io   Ready         28m       v1.9.4
</code>

NodeAffinity works on label matching. Let’s label node1 and verify it:

<code>kubectl label nodes node1.compute.infracloud.io thisnode=TheChosenOne
</code>

kubectl get nodes --show-labels | grep TheChosenOne

Now let’s try to deploy the entire guestbook on the node1. In all the deployment yaml files, a NodeAffinity for node1 is added as,

<code>
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "thisnode"
                operator: In
                values: ["TheChosenOne"]
</code>

guestbook_create.sh deploys the guestbook.

In a couple of minutes, you should be able to see that all the pods are scheduled on node1.

<code>NAME                            READY     STATUS    RESTARTS   AGE       IP            NODE
frontend-85b968cdc5-c785v       1/1       Running   0          49s       10.20.29.13   node1.compute.infracloud.io
frontend-85b968cdc5-pw2kl       1/1       Running   0          49s       10.20.29.14   node1.compute.infracloud.io
frontend-85b968cdc5-xxh7h       1/1       Running   0          49s       10.20.29.15   node1.compute.infracloud.io
redis-master-7bbf6b76bf-ttb6b   1/1       Running   0          1m        10.20.29.10   node1.compute.infracloud.io
redis-slave-747f8bc7c5-2tjtw    1/1       Running   0          1m        10.20.29.11   node1.compute.infracloud.io
redis-slave-747f8bc7c5-clxzh    1/1       Running   0          1m        10.20.29.12   node1.compute.infracloud.io
</code>

The output will also yield a load balancer ingress URL which can be used to load the guestbook. To finish off, let’s use guestbook_cleanup.sh to remove the guestbook.

Pod Affinity & AntiAffinity

In Kubernetes, node affinity allows you to schedule a pod on a set of nodes based on labels present on the nodes. However, in certain scenarios, we might want to schedule certain pods together or we might want to make sure that certain pods are never scheduled together. This can be achieved by PodAffinity and/or PodAntiAffinity respectively. Similar to node affinity, there are a couple of variants in pod affinity namely requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.

Use cases

* While scheduling workload, when we need to schedule a certain set of pods together, PodAffinity makes sense. Example, a web server and a cache. * While scheduling workload, when we need to make sure that a certain set of pods are not scheduled together, PodAntiAffinity makes sense. For example you may not want two applications which are both disk intensive to be on same node.

Pod Affinity walk-through

Let’s deploy deployment-Affinity.yaml, which has pod affinity as

<code>
affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - nginx
      topologyKey: "kubernetes.io/hostname"
</code>

Here we are specifying that all nginx pods should be scheduled together. Let’s apply and verify:

<code>kubectl apply -f deployment-Affinity.yaml
</code>

kubectl get pods -o wide -w

You should be able to see that all pods are scheduled on the same node.

<code>NAME                                READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-deployment-6bc5bb7f45-49dtg   1/1       Running   0          36m       10.20.29.18   node2.compute.infracloud.io
nginx-deployment-6bc5bb7f45-4ngvr   1/1       Running   0          36m       10.20.29.20   node2.compute.infracloud.io
nginx-deployment-6bc5bb7f45-lppkn   1/1       Running   0          36m       10.20.29.19   node2.compute.infracloud.io
</code>

To clean up, run,

<code>kubectl delete -f deployment-Affinity.yaml
</code>

Pod Anti Affinity example

Let’s deploy deployment-AntiAffinity.yaml, which has pod affinity as

<code>
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - nginx
      topologyKey: "kubernetes.io/hostname"
</code>

Here we are specifying that no two nginx pods should be scheduled together.

<code>kubectl apply -f deployment-AntiAffinity.yaml
</code>

kubectl get pods -o wide -w

You should be able to see that pods are scheduled on different nodes.

<code>NAME                                READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-deployment-85d87bccff-4w7tf   1/1       Running   0          27s       10.20.29.16   node3.compute.infracloud.io
nginx-deployment-85d87bccff-7fn47   1/1       Running   0          27s       10.20.42.32   node1.compute.infracloud.io
nginx-deployment-85d87bccff-sd4lp   1/1       Running   0          27s       10.20.13.17   node2.compute.infracloud.io

</code>

Note: In above example, if the number of replicas is more than the number of nodes then some of the pods will remain in pending state.

To clean up, run,

<code>kubectl delete -f deployment-AntiAffinity.yaml
</code>

This covers the advance scheduling mechanisms provided by Kubernetes. Have any questions? Feel free to drop a comment below.

To sum it up, Kubernetes provides simple mechanisms like taints, tolerations, node affinity and pod affinity to schdule workloads dynamically. The mechanisms are simple but when used with labels & selectors they provide a fairly good leverage on how to schedule pods.

Looking for help with Kubernetes adoption or Day 2 operations? do check out how we’re helping startups & enterprises with our managed services for Kubernetes.