All the code for this chapter (and other chapters) is available at https://github.com/param108/kubernetes101 check the directory 015

In the last post we looked at nodeSelector. You can use it to force a pod on a particular node in the cluster. Lets take some time to understand why would we be choosy about which node a pod resides on.

Choosing Nodes

Nodes on a kubernetes system need not be the same. Its only important that they run the same kubernetes version.

High-Availability

The most basic reason you would want to choose nodes for pods is to allow for the case that a node goes down. A simple rule would be, replicas should not be on the same node, as much as possible.

Heterogeneous Nodes

Another example: Suppose you run a team which develops a few simple web services with low load and some data-science boxes that hog memory and CPU. Instead of having all your nodes the same size, you can have a few bigger boxes and a few more normal size boxes. Then you can send all the data-science pods to the Bigger boxes and send the web pods to the normal size boxes.

Migration

There are certain issues that may crop up in a kubernetes node, causing it to stop working properly. You then need to move pods slowly away from this node. An option here is to mark the faulty node as unschedulable and scale the pods. Then finally kill the pods in the faulty node.

I hope these examples give you an idea of why control over pod scheduling is important.

NodeAffinity, PodAffinity, Pod-Anti-Affinity

spec.affinity in the podspec has 3 optional members

  1. nodeAffinity
  2. podAffinity
  3. podAntiAffinity

nodeAffinity: This is used to either force the pod to be scheduled on a node or force it to avoid certain nodes.
podAffinity: This is used to base a scheduling decision on other pods on the node. If a match is found, the pod will be scheduled on a node with pods that match the match-clause.
antiPodAffinity: This is also used to base a scheduling decision on other pods on the node. If a match is found, the pod will NOT be scheduled on a node with pods that match the match-clause.

Preference or Required

Each of the above members can have any or both of 2 possible members

  1. requiredDuringSchedulingIgnoredDuringExecution
  2. preferredDuringSchedulingIgnoredDuringExecution

As the names suggest required... is a forced match, while preferred... is a suggestion with weights etc.

preferredDuringSchedulingIgnoredDuringExecution is an array of NodeSelectorTerms with weights. The weights are numbers and denote the importance of the NodeSelector.

requiredDuringSchedulingIgnoredDuringExecution is an array of NodeSelectorTerms without weights. All the Selectors have to match to consider the node a match

# for nodeAffinity the below 2 options exist
preferredDuringSchedulingIgnoredDuringExecution =
[ {
   NodeSelectorTerm,
   Weight,
   }... ]

requiredDuringSchedulingIgnoredDuringExecution =
[ NodeSelectorTerm... ]

# for [Anti]podAffinity the below 2 options exist
preferredDuringSchedulingIgnoredDuringExecution =
[ {
   PodAffinityTerm,
   Weight,
   }... ]

requiredDuringSchedulingIgnoredDuringExecution =
[ PodAffinityTerm... ]

NodeSelectorTerm

The NodeSelectorTerm can have 2 members

matchExpressions: matches on node labels
matchFields: matches on node fields (like name etc)

Each of these has a format like below

{
key: The label or field to look at.
operator:  One of [In, NotIn, Exists, DoesNotExist. Gt, Lt]
values: Array of possible values of the key
}

PodAffinityTerm

The PodAffinityTerm has 3 members

  1. labelselector: Has an array of Match Criteria similar to MatchExpressions above.
  2. namespaces: The list of namespaces that the search for pods covers. Can be empty, which means all namespaces.
  3. topologyKey: This defines the concept of colocation. Nodes with the same value of the label mentioned in topologyKey are considered colocated. For example, if we want pods to be together using PodAffinity, they may still be on different nodes because the value of the label mentioned in topologyKey is the same.

LabelSelector has the following members

matchExpressions: 
  [{
     key: The label or field to look at.
     operator:  One of [In, NotIn, Exists, DoesNotExist. Gt, Lt]
     values: Array of possible values of the key
   }...]

This allows quite sophisticated match criteria. From a maintainability perspective its better to keep these as minimal as possible and clean up unused match criteria.

War!

Alice and Bob are a lot happier since their RBAC debacle. You, as their boss, though, are seeing a rising number of issues in the cluster because of misbehaving pods.

Misbehaving pods are pods that hog CPU and resources.

Whenever there is a failure in Alice’s components she is able to, skillfully, point to atleast one of Bob’s pods thats misbehaving. Similarly, failures in Bob’s components are being attributed to Alice’s misbehaving pods.

What do you do ?

Peace

You decide to allocate nodes to each team. team:[alice|bob]. Alice and Bob need to set the affinity of their pods accordingly.

The cluster has 4 non-master nodes , so each team gets 2 nodes.

Alice’s problem

Alice has a web service which runs three replicas. With the above constraint she only has 2 nodes to use for herself.

How will she maintain equal distribution of her pods ?

Eureka

Alice has a Eureka moment!

Using Affinity she can tell the pods to look for nodes with label team = alice and avoid nodes which already have pods of the same type. Lets see what that looks like. Using the previous web application, now available in 015/service we can test it on kind.

create registry

In order to push our images to Kind clusters we need to create a registry and add its details to the kind configuration. There is shell script at 015/service/registry.sh which will create the registry for you and print the ip address of the registry. The contents of this are below. The port being used by the registry is 5000.

docker run -d --restart=always -p "5000:5000" --name "kind-registry" registry:2
# print the ip address of the registry
docker inspect kind-registry -f '{{.NetworkSettings.IPAddress}}'

The output of the script for me was

172.17.0.7

create cluster

Lets create the cluster with one control-plane and 4 nodes. We need to modify the plan vanilla config as below to register the docker registry to use. The ip address which we got previously will be used in the endpoint. Please change it to the ip you get when you run the command. This file is available at 015/service/kind.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
- role: worker
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
    endpoint = ["http://172.17.0.7:5000"]
$ kubectl get nodes
NAME                     STATUS   ROLES    AGE     VERSION
realkind-control-plane   Ready    master   4m55s   v1.17.0
realkind-worker          Ready    <none>   4m11s   v1.17.0
realkind-worker2         Ready    <none>   4m11s   v1.17.0
realkind-worker3         Ready    <none>   4m11s   v1.17.0
realkind-worker4         Ready    <none>   4m12s   v1.17.0

setup the nodes

Lets label the nodes.

$ kubectl label nodes realkind-worker team=alice
node/realkind-worker labeled

$ kubectl label nodes realkind-worker2 team=alice
node/realkind-worker2 labeled

$ kubectl label nodes realkind-worker3 team=bob
node/realkind-worker3 labeled

$ kubectl label nodes realkind-worker4 team=bob
node/realkind-worker4 labeled

configure the pods

Postgres Pod

Install the service for the database (015/service/components_service.yml)

apiVersion: v1
kind: Service
metadata:
  name: components
spec:
  clusterIP: None
  selector:
    type: db
  ports:
    - name: postgres
      protocol: TCP
      port: 5432
      targetPort: 5432

Now install the postgres pod and make sure its on alice’s node 015/service/postgres.yml

apiVersion: v1
kind: Pod
metadata:
  name: database
  labels:
    app: web
    type: db
spec:
  hostname: database
  subdomain: components
  containers:
  - name: db
    image: postgres:9.6
    ports:
      - containerPort: 5432
    env:
      - name: POSTGRES_DB
        value: web
      - name: POSTGRES_USER
        value: web
      - name: POSTGRES_PASSWORD
        value: web
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: team
                operator: In
                values:
                  - alice

See the nodeAffinity set to nodes with team=alice. Also note it is a required condition. Once we apply we can see that the pods is scheduled on

$ kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP           NODE               
database   1/1     Running   0          7m25s   10.244.4.2   realkind-worker2

If you scroll up you will see that realkind-worker2 is an alice node.

App Pod

First we need to build the web docker image and push it to the new registry we just created. To do this go to 015/service directory and do

# in 015/service directory
make docker

Now that the image has been loaded to local docker of kind. We can apply the PodSpec.

Now for the replicaset. You can find the complete code at 015/service/webreplicas.yml. The interesting part is in the pod template of the replica.

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: backend
  labels:
    app: web
    type: replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
      type: app
  template:
    metadata:
      name: web
      labels:
        app: web
        type: app
    spec:
      containers:
      - name: web
        image: localhost:5000/web:latest
        imagePullPolicy: IfNotPresent
        command: ["/web"]
        args: []
        ports:
          - containerPort: 8080
        env:
          - name: db_name
            value: web
          - name: db_user
            value: web
          - name: db_pass
            value: web
          - name: db_host
            value: database.components
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: team
                  operator: In
                  values:
                    - alice
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: "kubernetes.io/hostname"
                labelSelector:
                  matchExpressions:
                    - key: type
                      operator: In
                      values:
                        - app

At the top we have the same nodeAffinity as our postgres which directs the pod to alice’s nodes. Below that we have a podAntiAffinity section. It basically says schedule this pod away from any pods which have type=app. The topologyKey says that nodes with the same hostname are considered colocated. As this is an AntiRule, this means matching pods should be on nodes with different hostname, i.e different nodes.

Apart from the affinity bits, if you notice the image details

        image: localhost:5000/web:latest
        imagePullPolicy: IfNotPresent

The image has localhost:5000/ as prefix specifying the repository. The imagePullPolicy is IfNotPresent. Kind behaves differently from minikube in this respect. There is no “internal” docker registry, you need to provide an alternate external registry.

Checking the results we see

NAME            READY   STATUS    RESTARTS   AGE   IP           NODE               
backend-29csn   1/1     Running   0          14m   10.244.2.2   realkind-worker2   
backend-mmpkj   1/1     Running   0          14m   10.244.3.2   realkind-worker  
backend-wg6s2   1/1     Running   0          14m   10.244.2.3   realkind-worker2
database        1/1     Running   0          12m   10.244.3.4   realkind-worker

As you can see, all the pods are on Alice’s nodes. Kubernetes tried to equally distribute equally amongst the 2 nodes.

Change podAntiAffinity in the pod template to podAffinity and see what happens.

You should see that all the pods colocate on the same node.

If kubernetes cannot schedule a pod that matches all “required” criteria, it will be in pending state.

“preferred” constraints are met with best effort.

Learnings

Using nodeAffinity and podAffinity we can program intelligent rules to help kubernetes schedule pods to the correct nodes.

There are many reasons why you want to do this.

All nodes in a kubernetes need not have same resources

Kind uses an alternative external docker rather than an internal one, unlike minikube.

It pays to keep config (such as affinity rules) simple so that the next guy need not pull his hair out to understand whats going on.

Conclusion

Kubernetes provides a number of primitives to control which nodes will run which pods. A system designer can use these to optimally utilize the node resources and provide optimal performance to the end user. We still have one more way to direct pods away from specific nodes. This is called taint. We will take a look at that in the next post along with the most rudimentary configuration podSpec.nodeName.