All the code for this chapter (and other chapters) is available at https://github.com/param108/kubernetes101 check the directory 015
In the last post we looked at nodeSelector. You can use it to force a pod on a particular node in the cluster. Lets take some time to understand why would we be choosy about which node a pod resides on.
Choosing Nodes
Nodes on a kubernetes system need not be the same. Its only important that they run the same kubernetes version.
High-Availability
The most basic reason you would want to choose nodes for pods is to allow for the case that a node goes down. A simple rule would be, replicas should not be on the same node, as much as possible.
Heterogeneous Nodes
Another example: Suppose you run a team which develops a few simple web services with low load and some data-science boxes that hog memory and CPU. Instead of having all your nodes the same size, you can have a few bigger boxes and a few more normal size boxes. Then you can send all the data-science pods to the Bigger boxes and send the web pods to the normal size boxes.
Migration
There are certain issues that may crop up in a kubernetes node, causing it to stop working properly. You then need to move pods slowly away from this node. An option here is to mark the faulty node as unschedulable and scale the pods. Then finally kill the pods in the faulty node.
I hope these examples give you an idea of why control over pod scheduling is important.
NodeAffinity, PodAffinity, Pod-Anti-Affinity
spec.affinity in the podspec has 3 optional members
- nodeAffinity
- podAffinity
- podAntiAffinity
nodeAffinity: This is used to either force the pod to be scheduled on a node or force it to avoid certain nodes.
podAffinity: This is used to base a scheduling decision on other pods on the node. If a match is found, the pod will be scheduled on a node with pods that match the match-clause.
antiPodAffinity: This is also used to base a scheduling decision on other pods on the node. If a match is found, the pod will NOT be scheduled on a node with pods that match the match-clause.
Preference or Required
Each of the above members can have any or both of 2 possible members
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
As the names suggest required...
is a forced match, while preferred...
is a suggestion with weights etc.
preferredDuringSchedulingIgnoredDuringExecution
is an array of NodeSelectorTerms with weights. The weights are numbers and denote the importance of the NodeSelector.
requiredDuringSchedulingIgnoredDuringExecution
is an array of NodeSelectorTerms without weights. All the Selectors have to match to consider the node a match
# for nodeAffinity the below 2 options exist
preferredDuringSchedulingIgnoredDuringExecution =
[ {
NodeSelectorTerm,
Weight,
}... ]
requiredDuringSchedulingIgnoredDuringExecution =
[ NodeSelectorTerm... ]
# for [Anti]podAffinity the below 2 options exist
preferredDuringSchedulingIgnoredDuringExecution =
[ {
PodAffinityTerm,
Weight,
}... ]
requiredDuringSchedulingIgnoredDuringExecution =
[ PodAffinityTerm... ]
NodeSelectorTerm
The NodeSelectorTerm can have 2 members
matchExpressions: matches on node labels
matchFields: matches on node fields (like name etc)
Each of these has a format like below
{
key: The label or field to look at.
operator: One of [In, NotIn, Exists, DoesNotExist. Gt, Lt]
values: Array of possible values of the key
}
PodAffinityTerm
The PodAffinityTerm has 3 members
- labelselector: Has an array of Match Criteria similar to MatchExpressions above.
- namespaces: The list of namespaces that the search for pods covers. Can be empty, which means all namespaces.
- topologyKey: This defines the concept of colocation. Nodes with the same value of the label mentioned in topologyKey are considered colocated. For example, if we want pods to be together using PodAffinity, they may still be on different nodes because the value of the label mentioned in topologyKey is the same.
LabelSelector has the following members
matchExpressions:
[{
key: The label or field to look at.
operator: One of [In, NotIn, Exists, DoesNotExist. Gt, Lt]
values: Array of possible values of the key
}...]
This allows quite sophisticated match criteria. From a maintainability perspective its better to keep these as minimal as possible and clean up unused match criteria.
War!
Alice and Bob are a lot happier since their RBAC debacle. You, as their boss, though, are seeing a rising number of issues in the cluster because of misbehaving pods.
Misbehaving pods are pods that hog CPU and resources.
Whenever there is a failure in Alice’s components she is able to, skillfully, point to atleast one of Bob’s pods thats misbehaving. Similarly, failures in Bob’s components are being attributed to Alice’s misbehaving pods.
What do you do ?
Peace
You decide to allocate nodes to each team. team
:[alice|bob]
. Alice and Bob need to set the affinity of their pods accordingly.
The cluster has 4 non-master nodes , so each team gets 2 nodes.
Alice’s problem
Alice has a web service which runs three replicas. With the above constraint she only has 2 nodes to use for herself.
How will she maintain equal distribution of her pods ?
Eureka
Alice has a Eureka moment!
Using Affinity she can tell the pods to look for nodes with label team
= alice
and avoid nodes which already have pods of the same type. Lets see what that looks like. Using the previous web application, now available in 015/service
we can test it on kind.
create registry
In order to push our images to Kind clusters we need to create a registry and add its details to the kind configuration. There is shell script at 015/service/registry.sh
which will create the registry for you and print the ip address of the registry. The contents of this are below. The port being used by the registry is 5000.
docker run -d --restart=always -p "5000:5000" --name "kind-registry" registry:2
# print the ip address of the registry
docker inspect kind-registry -f '{{.NetworkSettings.IPAddress}}'
The output of the script for me was
172.17.0.7
create cluster
Lets create the cluster with one control-plane and 4 nodes. We need to modify the plan vanilla config as below to register the docker registry to use. The ip address which we got previously will be used in the endpoint. Please change it to the ip you get when you run the command. This file is available at 015/service/kind.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
- role: worker
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
endpoint = ["http://172.17.0.7:5000"]
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
realkind-control-plane Ready master 4m55s v1.17.0
realkind-worker Ready <none> 4m11s v1.17.0
realkind-worker2 Ready <none> 4m11s v1.17.0
realkind-worker3 Ready <none> 4m11s v1.17.0
realkind-worker4 Ready <none> 4m12s v1.17.0
setup the nodes
Lets label the nodes.
$ kubectl label nodes realkind-worker team=alice
node/realkind-worker labeled
$ kubectl label nodes realkind-worker2 team=alice
node/realkind-worker2 labeled
$ kubectl label nodes realkind-worker3 team=bob
node/realkind-worker3 labeled
$ kubectl label nodes realkind-worker4 team=bob
node/realkind-worker4 labeled
configure the pods
Postgres Pod
Install the service for the database (015/service/components_service.yml
)
apiVersion: v1
kind: Service
metadata:
name: components
spec:
clusterIP: None
selector:
type: db
ports:
- name: postgres
protocol: TCP
port: 5432
targetPort: 5432
Now install the postgres pod and make sure its on alice’s node 015/service/postgres.yml
apiVersion: v1
kind: Pod
metadata:
name: database
labels:
app: web
type: db
spec:
hostname: database
subdomain: components
containers:
- name: db
image: postgres:9.6
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: web
- name: POSTGRES_USER
value: web
- name: POSTGRES_PASSWORD
value: web
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: team
operator: In
values:
- alice
See the nodeAffinity set to nodes with team=alice. Also note it is a required condition. Once we apply we can see that the pods is scheduled on
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
database 1/1 Running 0 7m25s 10.244.4.2 realkind-worker2
If you scroll up you will see that realkind-worker2 is an alice node.
App Pod
First we need to build the web
docker image and push it to the new registry we just created. To do this go to 015/service
directory and do
# in 015/service directory
make docker
Now that the image has been loaded to local docker of kind. We can apply the PodSpec.
Now for the replicaset. You can find the complete code at 015/service/webreplicas.yml
. The interesting part is in the pod template of the replica.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: backend
labels:
app: web
type: replicaset
spec:
replicas: 3
selector:
matchLabels:
app: web
type: app
template:
metadata:
name: web
labels:
app: web
type: app
spec:
containers:
- name: web
image: localhost:5000/web:latest
imagePullPolicy: IfNotPresent
command: ["/web"]
args: []
ports:
- containerPort: 8080
env:
- name: db_name
value: web
- name: db_user
value: web
- name: db_pass
value: web
- name: db_host
value: database.components
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: team
operator: In
values:
- alice
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: "kubernetes.io/hostname"
labelSelector:
matchExpressions:
- key: type
operator: In
values:
- app
At the top we have the same nodeAffinity as our postgres which directs the pod to alice’s nodes. Below that we have a podAntiAffinity section. It basically says schedule this pod away from any pods which have type=app. The topologyKey says that nodes with the same hostname are considered colocated. As this is an AntiRule, this means matching pods should be on nodes with different hostname, i.e different nodes.
Apart from the affinity bits, if you notice the image details
image: localhost:5000/web:latest
imagePullPolicy: IfNotPresent
The image has localhost:5000/ as prefix specifying the repository. The imagePullPolicy is IfNotPresent. Kind behaves differently from minikube in this respect. There is no “internal” docker registry, you need to provide an alternate external registry.
Checking the results we see
NAME READY STATUS RESTARTS AGE IP NODE
backend-29csn 1/1 Running 0 14m 10.244.2.2 realkind-worker2
backend-mmpkj 1/1 Running 0 14m 10.244.3.2 realkind-worker
backend-wg6s2 1/1 Running 0 14m 10.244.2.3 realkind-worker2
database 1/1 Running 0 12m 10.244.3.4 realkind-worker
As you can see, all the pods are on Alice’s nodes. Kubernetes tried to equally distribute equally amongst the 2 nodes.
Change podAntiAffinity in the pod template to podAffinity and see what happens.
You should see that all the pods colocate on the same node.
If kubernetes cannot schedule a pod that matches all “required” criteria, it will be in pending state.
“preferred” constraints are met with best effort.
Learnings
Using nodeAffinity and podAffinity we can program intelligent rules to help kubernetes schedule pods to the correct nodes.
There are many reasons why you want to do this.
All nodes in a kubernetes need not have same resources
Kind uses an alternative external docker rather than an internal one, unlike minikube.
It pays to keep config (such as affinity rules) simple so that the next guy need not pull his hair out to understand whats going on.
Conclusion
Kubernetes provides a number of primitives to control which nodes will run which pods. A system designer can use these to optimally utilize the node resources and provide optimal performance to the end user. We still have one more way to direct pods away from specific nodes. This is called taint. We will take a look at that in the next post along with the most rudimentary configuration podSpec.nodeName.