This article continues from Part 1 of advanced Kubernetes scheduling. In part 1, we had discussed Taints and tolerations. In this article, we will take a look at other scheduling mechanisms provided by Kubernetes that can help us direct workloads to a particular node or scheduling pods together.
Node affinity is a way to set rules based on which the scheduler can select the nodes for scheduling workload. Node affinity can be thought of as the opposite of taints. Taints when applied to a node repel a certain set of workloads whereas node affinity, when applied to a pod, gets attracted a certain set of nodes.
NodeAffinity is a generalization of nodeSelector. In nodeSelector, we specifically mention which node the pod should go to, using node affinity we specify certain rules to select nodes on which pod can be scheduled. These rules are defined by labelling the nodes and having pod spec specify the selectors to match those labels. There are 2 types of affinity rules: preferred rules and required rules. These two rules are checked & applied at scheduling and later on if there is a change in the state of labels etc. Based on the combination, three distinct policies can be enabled for scheduling decisions.
In the_ Preferred rule_, a pod will be assigned on a non-matching node if and only if no other node in the cluster matches the specified labels. preferredDuringSchedulingIgnoredDuringExecution is a preferred rule affinity.
In the Required rules, if there are no matching nodes, then the pod won’t be scheduled. In requiredDuringSchedulingIgnoredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node. However, once the pod is scheduled, labels are ignored meaning even if the node labels change, the pod will continue to run on that node.
In the requiredDuringSchedulingRequiredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node and if the labels on the node change in future, the pod will be evicted. This effect is similar to NoExecute taint with one significant difference. When NoExecute taint is applied on a node, every pod not having a toleration will be evicted, whereas, removing/changing a label will remove only the pods that do specify a different label.
While scheduling workload, when we need to schedule a certain set of pods on a certain set of nodes but do not want those nodes to reject everything else, using node affinity makes sense.
This assumes that you have cloned the kubernetes-scheduling-examples. Let’s begin with listing nodes.
<code>kubectl get nodes
</code>
You should be able to see the list of nodes available in the cluster,
<code>NAME STATUS AGE VERSION
node1.compute.infracloud.io Ready 25m v1.9.4
node2.compute.infracloud.io Ready 25m v1.9.4
node3.compute.infracloud.io Ready 28m v1.9.4
</code>
NodeAffinity works on label matching. Let’s label node1 and verify it:
<code>kubectl label nodes node1.compute.infracloud.io thisnode=TheChosenOne
</code>
kubectl get nodes --show-labels | grep TheChosenOne
Now let’s try to deploy the entire guestbook on the node1. In all the deployment yaml files, a NodeAffinity for node1 is added as,
<code>
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "thisnode"
operator: In
values: ["TheChosenOne"]
</code>
guestbook_create.sh deploys the guestbook.
In a couple of minutes, you should be able to see that all the pods are scheduled on node1.
<code>NAME READY STATUS RESTARTS AGE IP NODE
frontend-85b968cdc5-c785v 1/1 Running 0 49s 10.20.29.13 node1.compute.infracloud.io
frontend-85b968cdc5-pw2kl 1/1 Running 0 49s 10.20.29.14 node1.compute.infracloud.io
frontend-85b968cdc5-xxh7h 1/1 Running 0 49s 10.20.29.15 node1.compute.infracloud.io
redis-master-7bbf6b76bf-ttb6b 1/1 Running 0 1m 10.20.29.10 node1.compute.infracloud.io
redis-slave-747f8bc7c5-2tjtw 1/1 Running 0 1m 10.20.29.11 node1.compute.infracloud.io
redis-slave-747f8bc7c5-clxzh 1/1 Running 0 1m 10.20.29.12 node1.compute.infracloud.io
</code>
The output will also yield a load balancer ingress URL which can be used to load the guestbook. To finish off, let’s use guestbook_cleanup.sh to remove the guestbook.
In Kubernetes, node affinity allows you to schedule a pod on a set of nodes based on labels present on the nodes. However, in certain scenarios, we might want to schedule certain pods together or we might want to make sure that certain pods are never scheduled together. This can be achieved by PodAffinity and/or PodAntiAffinity respectively. Similar to node affinity, there are a couple of variants in pod affinity namely requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
Let’s deploy deployment-Affinity.yaml, which has pod affinity as
<code>
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
</code>
Here we are specifying that all nginx pods should be scheduled together. Let’s apply and verify:
<code>kubectl apply -f deployment-Affinity.yaml
</code>
kubectl get pods -o wide -w
You should be able to see that all pods are scheduled on the same node.
<code>NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-6bc5bb7f45-49dtg 1/1 Running 0 36m 10.20.29.18 node2.compute.infracloud.io
nginx-deployment-6bc5bb7f45-4ngvr 1/1 Running 0 36m 10.20.29.20 node2.compute.infracloud.io
nginx-deployment-6bc5bb7f45-lppkn 1/1 Running 0 36m 10.20.29.19 node2.compute.infracloud.io
</code>
To clean up, run,
<code>kubectl delete -f deployment-Affinity.yaml
</code>
Let’s deploy deployment-AntiAffinity.yaml, which has pod affinity as
<code>
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
</code>
Here we are specifying that no two nginx pods should be scheduled together.
<code>kubectl apply -f deployment-AntiAffinity.yaml
</code>
kubectl get pods -o wide -w
You should be able to see that pods are scheduled on different nodes.
<code>NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-85d87bccff-4w7tf 1/1 Running 0 27s 10.20.29.16 node3.compute.infracloud.io
nginx-deployment-85d87bccff-7fn47 1/1 Running 0 27s 10.20.42.32 node1.compute.infracloud.io
nginx-deployment-85d87bccff-sd4lp 1/1 Running 0 27s 10.20.13.17 node2.compute.infracloud.io
</code>
Note: In above example, if the number of replicas is more than the number of nodes then some of the pods will remain in pending state.
To clean up, run,
<code>kubectl delete -f deployment-AntiAffinity.yaml
</code>
This covers the advance scheduling mechanisms provided by Kubernetes. Have any questions? Feel free to drop a comment below.
To sum it up, Kubernetes provides simple mechanisms like taints, tolerations, node affinity and pod affinity to schdule workloads dynamically. The mechanisms are simple but when used with labels & selectors they provide a fairly good leverage on how to schedule pods.
Looking for help with Kubernetes adoption or Day 2 operations? do check out how we’re helping startups & enterprises with our managed services for Kubernetes.