Skip to main content

Understand kubernetes 4 : Scheduler

The most known job of a container orchestration is to "Assign Pods to Nodes", or so-called scheduling. If all the Pods and Nodes are the same, it becomes a trivial problem to solve - a round robin policy would do the job. In practice, However, Pods have different resource requirements, and less obvious that the nodes may have different capabilities - thinking machines purchased 5 years ago and brand new ones.

An Analogy: Rent a house

Say you want to rent a house, and you tell the agent that any house with 2 bedrooms and 2 bathrooms is fine; However, you don't want a house with swimming Pool, since you would rather be going to the beaches and don't have to pay for something you won't use.
That actually covers the main concepts/job for the k8s scheduler.
  • You/Tenant: have some requirements (rooms)
  • Agent: k8s scheduler
  • Houses(owned by Landlords): The nodes.
You tell the Agent the must-have, definite no-no, and nice-to-have requirements.
Agent's job is to find you the house matches your requirement and anti-requirement.
The owner can also reject an application base on his preference (say no pets).

Requirements for Pod scheduler

Let's see some practical requirements when placing a Pod to Node.
1 Run Pods on a specific type of Nodes : e.g: run this Pod on Ubuntu 17.10 only.
2 Run Pods of different services on the same Node: e.g Place weberver and memcache on some Node.
3 Spread Pods of a service to different Nodes: e.g Place the websever on nodes in different zone for fault toleratnt.
4 Best utilization of the resource: e.g run as "much" job as possible but be able to preempty the low priority one.
In k8s world,
1, 2 can be resolved using Affinity
3 can be resolved using Anti-Affinity
4 can be resolved using Taint and Toleration and Priority and Preemption
Before we talking about those scheduler policies and we first need a way to identify the Nodes. Without the identification, the scheduler can do nothing more/better than allocating with only the capacity information of the node.

Lable the Nodes

Nothing fancy. Nodes are labeled.
You can add any label you want but there are predefined common labels, including
  • hostname
  • os/arch/instance-type
  • zone/region
The first may be used to identify a single node, the 2nd one for a type of nodes, the last one is for geolocation related fault toleration or scalability.


There two type of Affinity, Node Affnity and Pod Affinity. The first one indicates an Affinity to a type of Node, and can be used to achieve the 1st requirement; the later one indicates the Affinity to Node with a certain type of Pods already running, and can be used to achieve 2nd requirement.
The affinity can be soft or hard, which nice-to-have and must respectively.
Reverse the logical of Affinity, it became Anti-Affinity, means Pod don't want to be in the Nodes with a certain type of feature. Requirement 3 can be implemented as "Pod doesn't want to be in the Node with the same Pod (using Pod Label)".
Side notes: You might know that in Linux a process can set it is cpu affinity, that is which CPU core it prefers to run on. It assembles to the problem of placing a Pod on a specific (type of) Node. As well as the CPUset in cgroup.

Taint and Toleration

Landlord tells to the Angent that he only want to rent the house to a programmer (for whatever reason). So unless a renter identifies himself as a programmer, the agent won't submit his application to the landlord.
Similar, a node can add some special requirement (called Taint) and use that to repel a set of nodes. Unless a Pod can tolerate the taint, it will be placed on the Node.
I found the concept of Taint and Tolerations was a little bit twisted, since Taint sounds like a bad stuff, unreasonable requirements/restriction that Pod has tolerate. It more likes landlord requires to pay the upfront rent for half a year and only the one who will tolerate this are able to apply.
One thing to remember is Taint is it is an attribution of Node and it gives Node an opportunity to have a voice for his preference; unlike Affinity is for Pod shows its preference to Node.

Priority and Preemption

Maximise resource utilization is important and it can be overlooked for most people don't have the experience of managing thousands of servers. As pointed out in section 5 of Borg paper, which k8s is inspired from.
One of Borg’s primary goals is to make efficient use of
Google’s fleet of machines, which represents a significant
financial investment: increasing utilization by a few percentages
points can save millions of dollars.
How to increasing utilization? That could mean many things, such as: schedule jobs fast, optimize the Pod allocation so that more jobs can be accommodated, and last but not least, be able to interrupt the low priority job with high priority one.
The last one just makes sense for machine. Do something always better than running idle. But when more important jobs coming, it will be preempted.
And an indication for the possibility of being preempted is we have to spend a minute of thinking about the effect of the Pod/service that may be evicted. Does it matter? How to gracefully terminate itself?

Make it real

To make things more real, take a look at this sample toy scheduler, which will bind a Pod to the cheapest Node as long as the Node can it can "fit" the resource requirements needed by the Pod.
Here are a few takeaways:
  1. You can roll your own scheduler.
  2. You can have more than one schedulers in the system. Each scheduler looks after a particular set/type of Pods and schedules them. (It doesn't make sense to have multiple schedulers trying to schedule the same set of Pods - there will be racing.)
  3. Scheduler always talks to the API server, as a client. It asks the APIs server for unscheduled Pods, scheduler them using a defined policy, and post the scheduler results ( i.e Pod/Node binding) to API server.
schedulerschedulerapi serverapi serverget me unscheduled Podsget me Node info/status/capacityschedule it according to a predefined policypost binding resultpost binding OK events
You can find default scheduler here.


We go over the requirement of a Pod scheduler and the way to achieve those requirements in k8s.

Popular posts from this blog

Android Security: An Overview Of Application Sandbox

The Problem : Define a  policy  to control how various  clients  can  access  different  resources . A  solution: Each  resource  has an  owner  and belongs to a  group . Each  client  has an  owner  but can belongs to multiple  groups . Each  resource  has a  mode  stating the  access permissions  allowed for its  owner ,  group  members and others, respectively. In the context of operating system, or Linux specifically, the  resources  can be files, sockets, etc; the  clients  are actually processes; and we have three  access permissions :read, write and execute.

Android Camera2 API Explained

Compared with the old camera API, the Camera2 API introduced in the L is a lot more complex: more than ten classes are involved, calls (almost always) are asynchronized, plus lots of capture controls and meta data that you feel confused about.

Android Security: A walk-through of SELinux

In  DAC , each process has an owner and belong to one or several groups, and each resource will be assigned different access permission for its owner and group members. It is useful and simple. The problem is once a program gain root privileged, it can do anything. It has only three permissions which you can control, which is very coarse. SELinux is to fix that. It is much fine-grained. It has lots of permissions defined for different type of resources. It is based on the principle of default denial. We need to write rules explicitly state what a process, or a type of process (called domain in SELinux), are allowed to do. That means even root processes are contained. A malicious process belongs to no domain actually end up can do nothing at all. This is a great enhancement to the DAC based security module, and hence the name Security-Enhanced Linux, aka SELinux.