Skip to main content

Understand Kubernetes 3 : etcd

In the last article, we said there was a statetore in the master node; in practice, it is implemented using etcdetcd is open source distributed key-value store (from coreOs) using the raft consensus algorithm. You can find a good introduction of etcd herek8s use etcd to store all the cluster information and is the only stateful component in the whole k8s (we don't count in the stateful components of the application itself).
Notably, it stores the following information:
  • Resource object/spec submitted by the user
  • The scheduler results from master node
  • Current status of work nodes and Pods

etcd is the critical

The stability and responsiveness of etcd is critical to stability & performance of the whole cluster. here is an excellent blog from open AI sharing that, there etcd system, hindered by 1) the high disk latency due to cloud backend and 2) high network io load incurred by the monitoring system, was one of the biggest issues they encountered when scaling the nodes to 2500.
For a production system, we will set up a separate etcd cluster and connect the k8s master to it. The master will store the requests to the etcd, update the results by controllers/schedulers, and the work nodes will watch the relevant state change through master and take action according, e,g start a container on itself.
It looks like this diagram:

usage of etcd in k8s

etcd is set up separately, but it has to be setup first so that the nodes ip (and tls info) of in the etcd cluster can be pass to the apiserver running on the master nodes. Using that information (etcd-servers and etcd-tls) apiserver will create an etc client (or multiple clients) talking to the etcd. That is all the connection between etcd and k8s.

All the components in the api-server will use storage.Interface to communicate with storage. etcd is the only backend implementation at the moment and it supports two versions of etcd, v2 and v3, which is the default.
class storage.Interface {
    Create(key string, obj runtime.Object))
k8s master, to be specific, apiserver component, act as one client of the etcd, using the etcd client to implement the storage. Interface API with a little bit more stuff that fits k8s model.
Let's see two APIs, Create and Watch.
For create, the value part of the k/v is a runtime object, e.g Deployment spec, a few more steps (encoder, transform) is needed before finally commit that to the etcd.
  • Create
Create(key string, obj runtime.Object)
obj -> encoder -> transformer ->  clientv3.OpPut(key, v string)
Besides the normal create/get/delete, there is one operation that is very important for distributed k/v store, watch, which allows you block wait on something and being notified when something is changed. As a user case, someone can watch a specific location for new pod creation/deletion and then take the corresponding action.
Kublete doesn't watch the storage direction, instead, it watches it through API server.
  • Watch
func (wc *watchChan) startWatching(watchClosedCh chan struct{}) {
    wch := wc.watcher.client.Watch(wc.ctx, wc.key, opts...)

pluggable backend storage

In theory, you should be able to replace etcd with other k/v stores, such as Consul and Zookeeper.
There was a PR to add Consul as the backend, but was closed (after three years) as "not ready to do this in the near future". Why create pluggable container runtime but not for the storage backend, which seems make sense as well. One of the possible technical reason is that k8s and etcd are already loosely coupled so doesn't worth the effort to create another layer to make it pluggable.


etcd is the components storing all the state for k8s cluster. It is availability and performance is vital to the whole k8s. apisever is the only one that talks to ectd using etc clients, request that submit to apiserver will be encoded and transformed before committing to etcd. Anyone can watch a particular state change but not directly to the etcd instead that go through the apiserver.

Popular posts from this blog

Understand Container - Index Page

This is an index page to a series of 8 articles on container implementation. OCI Specification Linux Namespaces Linux Cgroup Linux Capability Mount and Jail User and Root Network and Hook Network and CNI
This page has a very good page view after being created. Then I was thinking if anyone would be interested in a more polished, extended, and easier to read version.
So I started a book called "understand container". Let me know if you will be interested in the work by subscribing here and I'll send the first draft version which will include all the 8 articles here. The free subscription will end at 31th, Oct, 2018.

* Remember to click "Share email with author (optional)", so that I can send the book to your email directly. 


Understand Container: OCI Specification

OCI is the industry collaborated effort to define an open containers specifications regarding container format and runtime - that is the official tone and is true. The history of how it comes to where it stands today from the initial disagreement is a very interesting story or case study regarding open source business model and competition.

But past is past, nowadays, OCI is non-argumentable THE container standard, IMO, as we'll see later in the article it is adopted by most of the mainstream container implementation, including docker, and container orchestration system, such as kubernetes, Plus, it is particularly helpful to anyone trying to understand how the containers works internally. Open source code are awesome but it is double awesome with high quality documentation!

Android Security: An Overview Of Application Sandbox

The Problem: Define a policy to control how various clients can access different resources. A solution: Each resource has an owner and belongs to a group.Each client has an owner but can belongs to multiple groups.Each resource has a mode stating the access permissions allowed for its owner, group members and others, respectively. In the context of operating system, or Linux specifically, the resources can be files, sockets, etc; the clients are actually processes; and we have three access permissions:read, write and execute.