Linux Security: seccomp, and its usage in Android and Docker


seccomp is short for SECure COMPuting. It sounds like a quite broad techniques but actually its scope is quite narrow, but effective. Simply put, it is a default deny white-list firewall used by kernel to restricting what syscalls a process can make.
seccomp is widely used lots of popular systems to sandbox the processes and/to reduce the kernel attacking surface, notably Chromium, Android and Docker.

How it works

We mentioned previously seccomp fundamentally is a white-list that kernel will check again for each process where a particular process are allowed to call a certain system call.
Technically, the white-list is written using Berkeley Packet Filter (BPF) rules, which will then be passed to seccomp system call.
Writing the rules using BPF and isn't intuitively for most programmers, so there are different wrappers making it more user friendly. Android use minijail, which is actually come from Chromium. Docker has golang wrapper, where you can write the profile in json format.
We'll see how they are used in practice.

seccomp in Android

Each process or service will have a seccomp policy defined by Android. minijail is the helper library used to parse the policy file and pass it to the kernel.
Below we'll see in detail how seccomp is used for mediaextractor service. Let's jump directly to the code:
#mediaextractor/minijail/minijail.cpp
static const char kSeccompFilePath[] = 
    "/system/etc/seccomp_policy/mediaextractor-seccomp.policy";
int MiniJail()
{
    struct minijail *jail = minijail_new();
    minijail_no_new_privs(jail);
    minijail_log_seccomp_filter_failures(jail);
    minijail_use_seccomp_filter(jail);
    minijail_parse_seccomp_filters(jail, kSeccompFilePath);
    minijail_enter(jail);
    minijail_destroy(jail);
    return 0;
}
It is quite straightforward, thanks to the very self explanatory function name and the great analogy (minijail) used here.
We first create a minijail, parse policy (converting into the BPF filter), and finally enter the jail (calling seccomp system call) (so called enter the jail).
A peek of format/content of the mediaextractor-seccomp.policy makes things clearer - it lists all the syscalls that are allowed in the target process.
ioctl: 1
futex: 1
prctl: 1
write: 1
getpriority: 1
mmap2: 1
close: 1
10munmap: 1
dupe: 1
mprotect: 1
getuid32: 1
setpriority: 1

seccomp in Docker

seccomp was introduced to Docker after v1.0. A seccomp profile can be specified at docker run time using -security-opt seccomp=.jsonparameters, when docker create or docker create.
docker run -it --rm --security-opt seccomp=.json alpine sh ...
If no seccomp profile is not specified, a default profile will be used. With the default profile, 40+ system calls out of 300+ are disabled to ensure a moderate protection. The secure profile is in JSON format, which will be converted to the BPF filter by Docker daemon, and then apply to the created process/container.
The applications packaged in the Docker can only allowed to call the system calls listed in the seccomp profile you specified, giving you more power to control the security aspect of the container.

summary

In this article, we discussed what is seccomp and how it used by Android and Docker to build a securer system.