Understand Kubernetes 3 : etcd

In the last article, we said there was a statetore in the master node; in practice, it is implemented using etcdetcd is open source distributed key-value store (from coreOs) using the raft consensus algorithm. You can find a good introduction of etcd herek8s use etcd to store all the cluster information and is the only stateful component in the whole k8s (we don't count in the stateful components of the application itself).
Notably, it stores the following information:
  • Resource object/spec submitted by the user
  • The scheduler results from master node
  • Current status of work nodes and Pods

etcd is the critical

The stability and responsiveness of etcd is critical to stability & performance of the whole cluster. here is an excellent blog from open AI sharing that, there etcd system, hindered by 1) the high disk latency due to cloud backend and 2) high network io load incurred by the monitoring system, was one of the biggest issues they encountered when scaling the nodes to 2500.
For a production system, we will set up a separate etcd cluster and connect the k8s master to it. The master will store the requests to the etcd, update the results by controllers/schedulers, and the work nodes will watch the relevant state change through master and take action according, e,g start a container on itself.
It looks like this diagram:

usage of etcd in k8s

etcd is set up separately, but it has to be setup first so that the nodes ip (and tls info) of in the etcd cluster can be pass to the apiserver running on the master nodes. Using that information (etcd-servers and etcd-tls) apiserver will create an etc client (or multiple clients) talking to the etcd. That is all the connection between etcd and k8s.

All the components in the api-server will use storage.Interface to communicate with storage. etcd is the only backend implementation at the moment and it supports two versions of etcd, v2 and v3, which is the default.
class storage.Interface {
    Create(key string, obj runtime.Object))
k8s master, to be specific, apiserver component, act as one client of the etcd, using the etcd client to implement the storage. Interface API with a little bit more stuff that fits k8s model.
Let's see two APIs, Create and Watch.
For create, the value part of the k/v is a runtime object, e.g Deployment spec, a few more steps (encoder, transform) is needed before finally commit that to the etcd.
  • Create
Create(key string, obj runtime.Object)
obj -> encoder -> transformer ->  clientv3.OpPut(key, v string)
Besides the normal create/get/delete, there is one operation that is very important for distributed k/v store, watch, which allows you block wait on something and being notified when something is changed. As a user case, someone can watch a specific location for new pod creation/deletion and then take the corresponding action.
Kublete doesn't watch the storage direction, instead, it watches it through API server.
  • Watch
func (wc *watchChan) startWatching(watchClosedCh chan struct{}) {
    wch := wc.watcher.client.Watch(wc.ctx, wc.key, opts...)

pluggable backend storage

In theory, you should be able to replace etcd with other k/v stores, such as Consul and Zookeeper.
There was a PR to add Consul as the backend, but was closed (after three years) as "not ready to do this in the near future". Why create pluggable container runtime but not for the storage backend, which seems make sense as well. One of the possible technical reason is that k8s and etcd are already loosely coupled so doesn't worth the effort to create another layer to make it pluggable.


etcd is the components storing all the state for k8s cluster. It is availability and performance is vital to the whole k8s. apisever is the only one that talks to ectd using etc clients, request that submit to apiserver will be encoded and transformed before committing to etcd. Anyone can watch a particular state change but not directly to the etcd instead that go through the apiserver.