Skip to main content

Understand Container 3: Linux Capabilities

 is used to break the super privileges enjoyed by the root user to fine-grained rights (well just to avoid saying capabilities) so that even being a root user you are able to whatever you want unless been granted corresponding capabilities.

prepare a rootfs

We'll need to install some additional tool (libcap) to explore the capabilities, so here some instruction of how to prepare such a rootfs.
First, create a docker container with libcap installed,
sudo docker run -it alpine sh -c 'apk add -U libcap; capsh --print'
using docker ps -a find out the container id of the one we just run, it should be the lastest one.
Then export the rootfs to create an runc runtime bundle.
mkdir rootfs docker export $container_id | tar -C rootfs -xvf - runc spec


Using the default config.json generated from runc spec, you are not allowed to set the hostname, even being root.
$ sudo runc run xyxy67 / # id uid=0(root) gid=0(root) / # hostname cool hostname: sethostname: Operation not permitted
That's because set hostname requires CAP_SYS_ADMIN capability, even being root. We can add that capability by adding CAP_SYS_ADMIN to boundingpermittedeffective list of the capabilities attribute of the init the process.

Run another container with the new configuration, and now you are allowed to set hostname.
$ sudo runc run xyxy67 / # hostname runc / # hostname hello / # hostname hello / #
Run another command in the same container, and it will able to set hostname as well, since it inherits the capability of the init process.
$ sudo runc exec -t xyxy67 /bin/sh [sudo] password for binchen: / # hostname hello / # hostname good / # hostname good

get the capability

get the pid of the two processes in the runtime pid namespace.
$ sudo runc ps xyxy67 UID PID PPID C STIME TTY TIME CMD root 26002 25993 0 11:42 pts/0 00:00:00 /bin/sh root 26059 26051 0 11:43 pts/1 00:00:00 /bin/sh
Install pscap on host,
sudo apt-get install libcap-ng-utils
check capabilities of the running process using the pids in host namespace.
$ pscap | grep "26059\|26002" 25993 26002 root sh kill, net_bind_service, sys_admin, audit_write 26051 26059 root sh kill, net_bind_service, sys_admin, audit_write

request additional capabality

The exec can require additional caps that don't exist in the config.json.
run another container xyxy78 without the CAP_SYS_ADMIN in the config.json.

Double check it really doesn't have the CAPS.
$ sudo runc ps xyxy78 UID PID PPID C STIME TTY TIME CMD root 27385 27376 0 11:57 pts/0 00:00:00 /bin/sh $ pscap | grep 27385 27376 27385 root sh kill, net_bind_service, audit_write
Start another process in xyxy78 but with additional CAP_SYS_ADMIN capability, using --cap option.
sudo runc exec --cap CAP_SYS_ADMIN xyxyx /bin/hostname cool
Under the hood of --cap option, it is to set up the capability list for the process that will be exec-ed, just as set up those things for in the config.json for the init process.


You can use capsh explore a little bit more. Run capsh --print inside of the container.

This is the output with default config.json:
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip Bounding set =cap_kill,cap_net_bind_service,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=
This is the output with added CAP_SYS_ADMIN capability. Compared with former one, we can see additional cap_sys_admin+ep in the "Current" and ap_sys_admin in the "Bounding Set". The "+ep" means the preceding capabilities are in both "effective" and "permitted" list. For more information regarding the capability list, see capabilities.
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip cap_sys_admin+ep Bounding set =cap_kill,cap_net_bind_service,cap_sys_admin,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=


We see how Linux capability is used to limit the things a process can do and thus increase the security of the container.

Popular posts from this blog

Understand Container - Index Page

This is an index page to a series of 8 articles on container implementation. OCI Specification Linux Namespaces Linux Cgroup Linux Capability Mount and Jail User and Root Network and Hook Network and CNI
This page has a very good page view after being created. Then I was thinking if anyone would be interested in a more polished, extended, and easier to read version.
So I started a book called "understand container". Let me know if you will be interested in the work by subscribing here and I'll send the first draft version which will include all the 8 articles here. The free subscription will end at 31th, Oct, 2018.

* Remember to click "Share email with author (optional)", so that I can send the book to your email directly. 


Android Camera2 API Explained

Compared with the old camera API, the Camera2 API introduced in the L is a lot more complex: more than ten classes are involved, calls (almost always) are asynchronized, plus lots of capture controls and meta data that you feel confused about.

Understand Container: OCI Specification

OCI is the industry collaborated effort to define an open containers specifications regarding container format and runtime - that is the official tone and is true. The history of how it comes to where it stands today from the initial disagreement is a very interesting story or case study regarding open source business model and competition.

But past is past, nowadays, OCI is non-argumentable THE container standard, IMO, as we'll see later in the article it is adopted by most of the mainstream container implementation, including docker, and container orchestration system, such as kubernetes, Plus, it is particularly helpful to anyone trying to understand how the containers works internally. Open source code are awesome but it is double awesome with high quality documentation!