Skip to main content

Understand Container 3: Linux Capabilities

 is used to break the super privileges enjoyed by the root user to fine-grained rights (well just to avoid saying capabilities) so that even being a root user you are able to whatever you want unless been granted corresponding capabilities.

prepare a rootfs

We'll need to install some additional tool (libcap) to explore the capabilities, so here some instruction of how to prepare such a rootfs.
First, create a docker container with libcap installed,
sudo docker run -it alpine sh -c 'apk add -U libcap; capsh --print'
using docker ps -a find out the container id of the one we just run, it should be the lastest one.
Then export the rootfs to create an runc runtime bundle.
mkdir rootfs docker export $container_id | tar -C rootfs -xvf - runc spec


Using the default config.json generated from runc spec, you are not allowed to set the hostname, even being root.
$ sudo runc run xyxy67 / # id uid=0(root) gid=0(root) / # hostname cool hostname: sethostname: Operation not permitted
That's because set hostname requires CAP_SYS_ADMIN capability, even being root. We can add that capability by adding CAP_SYS_ADMIN to boundingpermittedeffective list of the capabilities attribute of the init the process.

Run another container with the new configuration, and now you are allowed to set hostname.
$ sudo runc run xyxy67 / # hostname runc / # hostname hello / # hostname hello / #
Run another command in the same container, and it will able to set hostname as well, since it inherits the capability of the init process.
$ sudo runc exec -t xyxy67 /bin/sh [sudo] password for binchen: / # hostname hello / # hostname good / # hostname good

get the capability

get the pid of the two processes in the runtime pid namespace.
$ sudo runc ps xyxy67 UID PID PPID C STIME TTY TIME CMD root 26002 25993 0 11:42 pts/0 00:00:00 /bin/sh root 26059 26051 0 11:43 pts/1 00:00:00 /bin/sh
Install pscap on host,
sudo apt-get install libcap-ng-utils
check capabilities of the running process using the pids in host namespace.
$ pscap | grep "26059\|26002" 25993 26002 root sh kill, net_bind_service, sys_admin, audit_write 26051 26059 root sh kill, net_bind_service, sys_admin, audit_write

request additional capabality

The exec can require additional caps that don't exist in the config.json.
run another container xyxy78 without the CAP_SYS_ADMIN in the config.json.

Double check it really doesn't have the CAPS.
$ sudo runc ps xyxy78 UID PID PPID C STIME TTY TIME CMD root 27385 27376 0 11:57 pts/0 00:00:00 /bin/sh $ pscap | grep 27385 27376 27385 root sh kill, net_bind_service, audit_write
Start another process in xyxy78 but with additional CAP_SYS_ADMIN capability, using --cap option.
sudo runc exec --cap CAP_SYS_ADMIN xyxyx /bin/hostname cool
Under the hood of --cap option, it is to set up the capability list for the process that will be exec-ed, just as set up those things for in the config.json for the init process.


You can use capsh explore a little bit more. Run capsh --print inside of the container.

This is the output with default config.json:
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip Bounding set =cap_kill,cap_net_bind_service,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=
This is the output with added CAP_SYS_ADMIN capability. Compared with former one, we can see additional cap_sys_admin+ep in the "Current" and ap_sys_admin in the "Bounding Set". The "+ep" means the preceding capabilities are in both "effective" and "permitted" list. For more information regarding the capability list, see capabilities.
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip cap_sys_admin+ep Bounding set =cap_kill,cap_net_bind_service,cap_sys_admin,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=


We see how Linux capability is used to limit the things a process can do and thus increase the security of the container.

Popular posts from this blog

Android Security: An Overview Of Application Sandbox

The Problem : Define a  policy  to control how various  clients  can  access  different  resources . A  solution: Each  resource  has an  owner  and belongs to a  group . Each  client  has an  owner  but can belongs to multiple  groups . Each  resource  has a  mode  stating the  access permissions  allowed for its  owner ,  group  members and others, respectively. In the context of operating system, or Linux specifically, the  resources  can be files, sockets, etc; the  clients  are actually processes; and we have three  access permissions :read, write and execute.

Android Camera2 API Explained

Compared with the old camera API, the Camera2 API introduced in the L is a lot more complex: more than ten classes are involved, calls (almost always) are asynchronized, plus lots of capture controls and meta data that you feel confused about.

Android Security: A walk-through of SELinux

In  DAC , each process has an owner and belong to one or several groups, and each resource will be assigned different access permission for its owner and group members. It is useful and simple. The problem is once a program gain root privileged, it can do anything. It has only three permissions which you can control, which is very coarse. SELinux is to fix that. It is much fine-grained. It has lots of permissions defined for different type of resources. It is based on the principle of default denial. We need to write rules explicitly state what a process, or a type of process (called domain in SELinux), are allowed to do. That means even root processes are contained. A malicious process belongs to no domain actually end up can do nothing at all. This is a great enhancement to the DAC based security module, and hence the name Security-Enhanced Linux, aka SELinux.