Skip to main content

Understand Container 3: Linux Capabilities

 is used to break the super privileges enjoyed by the root user to fine-grained rights (well just to avoid saying capabilities) so that even being a root user you are able to whatever you want unless been granted corresponding capabilities.

prepare a rootfs

We'll need to install some additional tool (libcap) to explore the capabilities, so here some instruction of how to prepare such a rootfs.
First, create a docker container with libcap installed,
sudo docker run -it alpine sh -c 'apk add -U libcap; capsh --print'
using docker ps -a find out the container id of the one we just run, it should be the lastest one.
Then export the rootfs to create an runc runtime bundle.
mkdir rootfs docker export $container_id | tar -C rootfs -xvf - runc spec


Using the default config.json generated from runc spec, you are not allowed to set the hostname, even being root.
$ sudo runc run xyxy67 / # id uid=0(root) gid=0(root) / # hostname cool hostname: sethostname: Operation not permitted
That's because set hostname requires CAP_SYS_ADMIN capability, even being root. We can add that capability by adding CAP_SYS_ADMIN to boundingpermittedeffective list of the capabilities attribute of the init the process.

Run another container with the new configuration, and now you are allowed to set hostname.
$ sudo runc run xyxy67 / # hostname runc / # hostname hello / # hostname hello / #
Run another command in the same container, and it will able to set hostname as well, since it inherits the capability of the init process.
$ sudo runc exec -t xyxy67 /bin/sh [sudo] password for binchen: / # hostname hello / # hostname good / # hostname good

get the capability

get the pid of the two processes in the runtime pid namespace.
$ sudo runc ps xyxy67 UID PID PPID C STIME TTY TIME CMD root 26002 25993 0 11:42 pts/0 00:00:00 /bin/sh root 26059 26051 0 11:43 pts/1 00:00:00 /bin/sh
Install pscap on host,
sudo apt-get install libcap-ng-utils
check capabilities of the running process using the pids in host namespace.
$ pscap | grep "26059\|26002" 25993 26002 root sh kill, net_bind_service, sys_admin, audit_write 26051 26059 root sh kill, net_bind_service, sys_admin, audit_write

request additional capabality

The exec can require additional caps that don't exist in the config.json.
run another container xyxy78 without the CAP_SYS_ADMIN in the config.json.

Double check it really doesn't have the CAPS.
$ sudo runc ps xyxy78 UID PID PPID C STIME TTY TIME CMD root 27385 27376 0 11:57 pts/0 00:00:00 /bin/sh $ pscap | grep 27385 27376 27385 root sh kill, net_bind_service, audit_write
Start another process in xyxy78 but with additional CAP_SYS_ADMIN capability, using --cap option.
sudo runc exec --cap CAP_SYS_ADMIN xyxyx /bin/hostname cool
Under the hood of --cap option, it is to set up the capability list for the process that will be exec-ed, just as set up those things for in the config.json for the init process.


You can use capsh explore a little bit more. Run capsh --print inside of the container.

This is the output with default config.json:
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip Bounding set =cap_kill,cap_net_bind_service,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=
This is the output with added CAP_SYS_ADMIN capability. Compared with former one, we can see additional cap_sys_admin+ep in the "Current" and ap_sys_admin in the "Bounding Set". The "+ep" means the preceding capabilities are in both "effective" and "permitted" list. For more information regarding the capability list, see capabilities.
# capsh --print Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip cap_sys_admin+ep Bounding set =cap_kill,cap_net_bind_service,cap_sys_admin,cap_audit_write Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=


We see how Linux capability is used to limit the things a process can do and thus increase the security of the container.


Post a Comment

Popular posts from this blog

Android Security: An Overview Of Application Sandbox

The Problem: Define a policy to control how various clients can access different resources. A solution: Each resource has an owner and belongs to a group.Each client has an owner but can belongs to multiple groups.Each resource has a mode stating the access permissions allowed for its owner, group members and others, respectively. In the context of operating system, or Linux specifically, the resources can be files, sockets, etc; the clients are actually processes; and we have three access permissions:read, write and execute.

Android Camera2 API Explained

Compared with the old camera API, the Camera2 API introduced in the L is a lot more complex: more than ten classes are involved, calls (almost always) are asynchronized, plus lots of capture controls and meta data that you feel confused about.

Android Security: A walk-through of SELinux

In DAC, each process has an owner and belong to one or several groups, and each resource will be assigned different access permission for its owner and group members. It is useful and simple. The problem is once a program gain root privileged, it can do anything. It has only three permissions which you can control, which is very coarse. SELinux is to fix that. It is much fine-grained. It has lots of permissions defined for different type of resources. It is based on the principle of default denial. We need to write rules explicitly state what a process, or a type of process (called domain in SELinux), are allowed to do. That means even root processes are contained. A malicious process belongs to no domain actually end up can do nothing at all. This is a great enhancement to the DAC based security module, and hence the name Security-Enhanced Linux, aka SELinux.