Skip to main content

Understand Kubernetes 1: Container Orchestration

By far, we know the benefits of the container and how the container is implemented using Linux primitives.
If we only need to one or two containers, we should be satisfied. That's all we need. But if we want to run dozens or thousands containers to build a stable and scalable web service that is able to server millions transaction per seconds, we have more problems to solve. To name a few:
  • scheduling: Which host to put a container?
  • update: How to update the container image and ensure zero downtime?
  • self-healing: How to detect and restart a container when it is down?
  • scaling: How to add more containers when more processing capacity is needed?
None of those issues are new but only the subject become containers, rather than physical servers (in the old days), or virtual machines as recently. The functionalities described above are usually referred as Container Orchestration.


kubernetes, abbreviated as k8s, is one of many container orchestration solutions. But, as of mid-2018, many would agree the competition is over; k8s is the de facto standard. I think it is a good news, freeing you from the hassle of picking from many options and worrying about investing in the wrong one. K8s is completely open source, with a variety of contributors from big companies to individual contributors.
k8s has a very good documentation, mostly here and here.
In this article, we'll take a different perspective. Instead of starting with how to use the tools, we'll start with the very object k8s platform is trying to manage - the container. We'll try to see what extra things k8s can do, compare with single machine container runtime such as runc or docker, and how k8s integrate with those container runtimes.
However, we can't do that without an understanding of the high-level architecture of k8s.

At the highest level, k8s is a master and slave architecture, with a master node controlling multiple slave or work nodes. master & slave nodes together are called a k8s clusterUser talks to the cluster using API, which is served by the master. We intentionally left the master node diagram empty, with a focus on the how the things are connected on the work node.
Master talks to work nodes through kublet, which primarily run and stop Pods, through CRI, which is connected to a container runtime. kublet also monitor Pods for liveness and pulling debug information and logs.
We'll go over the components in a little more detail below.


There are two type of nodes, master node and slave node. A node can either be a physical machine or virtual machine.
You can jam the whole k8s cluster into a single machine, such as using minikube.


Each work note has a kubelet, it is the agent that enables the master node talk to the slaves.
The responsibility of kubelet includes:
  • Creating/running the Pod
  • Probe Pods
  • Monitor Nodes/Pod
  • etc.
We can go nowhere without first introducing Pod.


In k8s, the smaller scheduling or deployment unit is Pod, not container. But there shouldn't be any cognitive overhead if you already know containers well. The benefits of Pod is to add another wrap on top of the container to make sure closely coupled contains are guaranteed end up being scheduled on the same host so that they can share a volume or network that would otherwise difficult or inefficient to implement if they being on different hosts.
A pod is a group of one or more containers, with shared storage and network, and a specification for how to run the containers. A pod’s contents are always co-located and co-scheduled and run in a shared context, such as namespaces and cgroups.
For details, you can find here.

Config, Scheduing and Run Pod

You config a Pod using ymal file, call it spec. As you can imagine, the Pod spec will include configurations for each container, which includes the image and the runtime configuration.
With this spec, the k8s will sure pull the image and run the container, just as you would do using simple docker command. Nothing quite innovative here.
What missing here is in the spec we'll describe the resource requirement for the containers/Pod, and the k8s will use that information along with current cluster status, find a suitable host for the host. This is called Pod scheduling. The functionality and effectiveness of the schedule may be overlooked, in the borg paper, it is mentioned a better schedule actually could save millions of dollar for in google scale.
In the spec, we can also specify the Liveness and Readiness Probes.

Probe Pods

The kubelet uses liveness probes to know when to restart a container, and readiness probes to know when a container is ready to start accepting traffic. The first is the foundation for self-healing and the second for load balancing.
Without k8s, you have to do all these by your owner. Time and $$ saved.

Container Runtime: CRI

k8s isn't binding to a particular container runtime, instead, it defines an interface for image management and container runtime. Anyone one implemented the interface can be plugged into the k8s, be more accurate, the kubelet.
There are multiple implementations of CRI. Docker has cri-contained that plugs the containd/docker into the kubelet. cri-o is another implementation, which wraps runc for the container runtime service and wraps a bunch of other libraries for the image service. Both use cni for the network setup.
Assuming a Pod/Container is assigned to a particular node, and the kubelet on that node will operate as follows:
kubeletkubeletcri clientcri clientcri servercri serverimage serviceimage serviceruntime service(runc)runtime service(runc)run containercreate (over gPRC)pull image from a registryunpack the image and create rootfscreate runtime config (config.json) using the pod specrun container


We go through why we need a container orchestration system, and then the high-level architecture of k8s, with a focus on the components in the work node and its integration with container runtime.

Popular posts from this blog

Android Security: An Overview Of Application Sandbox

The Problem: Define a policy to control how various clients can access different resources. A solution: Each resource has an owner and belongs to a group.Each client has an owner but can belongs to multiple groups.Each resource has a mode stating the access permissions allowed for its owner, group members and others, respectively. In the context of operating system, or Linux specifically, the resources can be files, sockets, etc; the clients are actually processes; and we have three access permissions:read, write and execute.

Android Camera2 API Explained

Compared with the old camera API, the Camera2 API introduced in the L is a lot more complex: more than ten classes are involved, calls (almost always) are asynchronized, plus lots of capture controls and meta data that you feel confused about.

Android Security: A walk-through of SELinux

In DAC, each process has an owner and belong to one or several groups, and each resource will be assigned different access permission for its owner and group members. It is useful and simple. The problem is once a program gain root privileged, it can do anything. It has only three permissions which you can control, which is very coarse. SELinux is to fix that. It is much fine-grained. It has lots of permissions defined for different type of resources. It is based on the principle of default denial. We need to write rules explicitly state what a process, or a type of process (called domain in SELinux), are allowed to do. That means even root processes are contained. A malicious process belongs to no domain actually end up can do nothing at all. This is a great enhancement to the DAC based security module, and hence the name Security-Enhanced Linux, aka SELinux.