Skip to main content

Understand kubernetes 4 : Scheduler

The most known job of a container orchestration is to "Assign Pods to Nodes", or so-called scheduling. If all the Pods and Nodes are the same, it becomes a trivial problem to solve - a round robin policy would do the job. In practice, However, Pods have different resource requirements, and less obvious that the nodes may have different capabilities - thinking machines purchased 5 years ago and brand new ones.

An Analogy: Rent a house

Say you want to rent a house, and you tell the agent that any house with 2 bedrooms and 2 bathrooms is fine; However, you don't want a house with swimming Pool, since you would rather be going to the beaches and don't have to pay for something you won't use.
That actually covers the main concepts/job for the k8s scheduler.
  • You/Tenant: have some requirements (rooms)
  • Agent: k8s scheduler
  • Houses(owned by Landlords): The nodes.
You tell the Agent the must-have, definite no-no, and nice-to-have requirements.
Agent's job is to find you the house matches your requirement and anti-requirement.
The owner can also reject an application base on his preference (say no pets).

Requirements for Pod scheduler

Let's see some practical requirements when placing a Pod to Node.
1 Run Pods on a specific type of Nodes : e.g: run this Pod on Ubuntu 17.10 only.
2 Run Pods of different services on the same Node: e.g Place weberver and memcache on some Node.
3 Spread Pods of a service to different Nodes: e.g Place the websever on nodes in different zone for fault toleratnt.
4 Best utilization of the resource: e.g run as "much" job as possible but be able to preempty the low priority one.
In k8s world,
1, 2 can be resolved using Affinity
3 can be resolved using Anti-Affinity
4 can be resolved using Taint and Toleration and Priority and Preemption
Before we talking about those scheduler policies and we first need a way to identify the Nodes. Without the identification, the scheduler can do nothing more/better than allocating with only the capacity information of the node.

Lable the Nodes

Nothing fancy. Nodes are labeled.
You can add any label you want but there are predefined common labels, including
  • hostname
  • os/arch/instance-type
  • zone/region
The first may be used to identify a single node, the 2nd one for a type of nodes, the last one is for geolocation related fault toleration or scalability.


There two type of Affinity, Node Affnity and Pod Affinity. The first one indicates an Affinity to a type of Node, and can be used to achieve the 1st requirement; the later one indicates the Affinity to Node with a certain type of Pods already running, and can be used to achieve 2nd requirement.
The affinity can be soft or hard, which nice-to-have and must respectively.
Reverse the logical of Affinity, it became Anti-Affinity, means Pod don't want to be in the Nodes with a certain type of feature. Requirement 3 can be implemented as "Pod doesn't want to be in the Node with the same Pod (using Pod Label)".
Side notes: You might know that in Linux a process can set it is cpu affinity, that is which CPU core it prefers to run on. It assembles to the problem of placing a Pod on a specific (type of) Node. As well as the CPUset in cgroup.

Taint and Toleration

Landlord tells to the Angent that he only want to rent the house to a programmer (for whatever reason). So unless a renter identifies himself as a programmer, the agent won't submit his application to the landlord.
Similar, a node can add some special requirement (called Taint) and use that to repel a set of nodes. Unless a Pod can tolerate the taint, it will be placed on the Node.
I found the concept of Taint and Tolerations was a little bit twisted, since Taint sounds like a bad stuff, unreasonable requirements/restriction that Pod has tolerate. It more likes landlord requires to pay the upfront rent for half a year and only the one who will tolerate this are able to apply.
One thing to remember is Taint is it is an attribution of Node and it gives Node an opportunity to have a voice for his preference; unlike Affinity is for Pod shows its preference to Node.

Priority and Preemption

Maximise resource utilization is important and it can be overlooked for most people don't have the experience of managing thousands of servers. As pointed out in section 5 of Borg paper, which k8s is inspired from.
One of Borg’s primary goals is to make efficient use of
Google’s fleet of machines, which represents a significant
financial investment: increasing utilization by a few percentages
points can save millions of dollars.
How to increasing utilization? That could mean many things, such as: schedule jobs fast, optimize the Pod allocation so that more jobs can be accommodated, and last but not least, be able to interrupt the low priority job with high priority one.
The last one just makes sense for machine. Do something always better than running idle. But when more important jobs coming, it will be preempted.
And an indication for the possibility of being preempted is we have to spend a minute of thinking about the effect of the Pod/service that may be evicted. Does it matter? How to gracefully terminate itself?

Make it real

To make things more real, take a look at this sample toy scheduler, which will bind a Pod to the cheapest Node as long as the Node can it can "fit" the resource requirements needed by the Pod.
Here are a few takeaways:
  1. You can roll your own scheduler.
  2. You can have more than one schedulers in the system. Each scheduler looks after a particular set/type of Pods and schedules them. (It doesn't make sense to have multiple schedulers trying to schedule the same set of Pods - there will be racing.)
  3. Scheduler always talks to the API server, as a client. It asks the APIs server for unscheduled Pods, scheduler them using a defined policy, and post the scheduler results ( i.e Pod/Node binding) to API server.
schedulerschedulerapi serverapi serverget me unscheduled Podsget me Node info/status/capacityschedule it according to a predefined policypost binding resultpost binding OK events
You can find default scheduler here.


We go over the requirement of a Pod scheduler and the way to achieve those requirements in k8s.

Popular posts from this blog

Understand Container - Index Page

This is an index page to a series of 8 articles on container implementation. OCI Specification Linux Namespaces Linux Cgroup Linux Capability Mount and Jail User and Root Network and Hook Network and CNI
This page has a very good page view after being created. Then I was thinking if anyone would be interested in a more polished, extended, and easier to read version.
So I started a book called "understand container". Let me know if you will be interested in the work by subscribing here and I'll send the first draft version which will include all the 8 articles here. The free subscription will end at 31th, Oct, 2018.

* Remember to click "Share email with author (optional)", so that I can send the book to your email directly. 


Understand Container: OCI Specification

OCI is the industry collaborated effort to define an open containers specifications regarding container format and runtime - that is the official tone and is true. The history of how it comes to where it stands today from the initial disagreement is a very interesting story or case study regarding open source business model and competition.

But past is past, nowadays, OCI is non-argumentable THE container standard, IMO, as we'll see later in the article it is adopted by most of the mainstream container implementation, including docker, and container orchestration system, such as kubernetes, Plus, it is particularly helpful to anyone trying to understand how the containers works internally. Open source code are awesome but it is double awesome with high quality documentation!

Android Security: An Overview Of Application Sandbox

The Problem: Define a policy to control how various clients can access different resources. A solution: Each resource has an owner and belongs to a group.Each client has an owner but can belongs to multiple groups.Each resource has a mode stating the access permissions allowed for its owner, group members and others, respectively. In the context of operating system, or Linux specifically, the resources can be files, sockets, etc; the clients are actually processes; and we have three access permissions:read, write and execute.