Files in Linux system are organized as a tree. The tree normally starts with a root file system (called
rootfs) provided by the Linux distribution, and the rootfs will be mounted as "/". Later, optionally additional file systems can be attached to a subdirectory.such as /data, which for example points to external USB disk.
mount(2)is the system call used to attach a file system or a directory to a node of the root tree. When system boot up, the init process will do multiple mount call to set up the file system properly, and that is the initial mount table. All the processes have its own mount table, but they normally pointing to the same one - the one setup by the init process. However, a process can also have separate mount table from its parent. It starts with copying the parent one but later any change to it (incurred by
mount) will only impact itself. And that's what
mount namespacemeans and for. Worth to note that in the same mount namespace, any change to the mount table by one process will be visible to another process. And because of this, when you mount a USB disk on the shell and the files explorer will be able to see the content as well.
Normally, the application won't create a separate mount namespace when being started.
For example, there are two mnt namespaces on my host,
But the first one has only one process kdevtmpfs, which is a kernel process.
All the other processes are in the second mount namespaces created by the
/sbin/init. And if you check the mount points of two processes in that mnt namespace (by cat /proc/pid/mounts), they are all same.(*)
*The exceptions are chrome process. The mount points for chrome are empty despite being in the same namespace. Honestly, not sure why.
container create an extra mnt namespace
Let's start a container and what changes in mnt namespaces,
Check the mount namespace
We have a new mount namespace, 4026532458, which is created when run container xyxy12:
And here the dump of the mount info for our new container.
Proballly the content doesn't interest you too much. We will skip the details of most of those entries here, but one :
This mount src is pointing to the
/dev/sda2, which is our host's rootfs mounting to.
Does that sound surprising and alarming to you? Why is the root of the container is same as the root of the host? So with a new mount namespace, we still can access the root of the host? Should't the container be "jailed" in the rootfs the contained is started in?
Let's check one more thing, compare the the inode number of the
/in the container and the inode of the container's rootfs.
They are same! That means the root of the container is really the rootfs, or directory of its runtime bundle, in oci's term, as you expected!
So, why? From the mount we see "/" is mounted to /dev/sda2, which is same as the root of the host but in fact the root is the container bundle directory?
The "jail" is done by privot_root, which basically changes the root of the process to the runtime bundle directory.
This is the code did that magic. The lastest version looks less easy to understand than the earlier version since it used an idea from lxc making privot_root working on read-only rootfs, so there is no need to create a temporary writable directory.
It won't be complete if we wouldn't mention
chroot(2)when talking about the filesystem for container. Actually, it is not mandatory to create a new mnt namespace and use privot_root. Optionally, but less ideally, you can use
chroot(2), which will "jail" the calling process (and all its children) into the rootfs the container starts with. Unlike the mount namespace,
chrootwon't change anything to mount, it just changes the process path lookup, interpreting the
/as the path chroot-ed to; for the difference between
privot_rootsee here. In short, privot_root is more thorough, and safer.
To use chroot, if you like, make following changes to your default config.json. In addition to removal of
type:mount, we have removed "maskedPaths" and "readonlyPaths" which will requires a private mount namespace to work.
If we re-do the exercise we did previously, you will find no new namespace will be created this time, but you can't list the files outside of the rootfs.
Throw some code here to make things more clear. Ignore NoPivotRoot at the moment and assume it is already false.
bind mountis a type of mount supported by Linux to remount part of the file hierarchy to somewhere else so it can be accessed from both places. It is used share the host directory with the container.
Make following change to the config.json. It tells the container runtime to bind mount a local directory (host_dir, an relative path to the runtime bundle) to a directory (/host_dir, an absolute path in container rootfs) in the container.
For bind mount to work, the host directory must exist before mounting, but not the bind destination dir, which will be created by the container runtime if not exists. Here is (part) the directory tree:
start the container with the new config.json, and we can see the content of
/hostin the container.
However, since the bind mount is happening in the mount namespace for the container, not on the host, you won't be able to see anything in the
We can double check this by looking at the mount info inside of the container:
Access host USB
Let's do more exercises how to access a host USB disk. why? Because usb disk is a volume device, it aligns with our topic today regarding data or filesystem in container! In addition, as we'll see later, bind mount can be used not only for mounting a host directory, but also a host device file, into the container.
Make following changes to the default config.json and we'll explain it shortly.
start the container with the new config and we will be able to read and write to the usb from the /usb2 directory inside of the container.
Explain the changes we made:
(1)bind mount the device node from the host (
/dev/sdb1) to the container(
(2)mount the device(
/dev/usb) to a directory inside of the container (
(3)set up the uid/guid the bash process to the same as the uid/guid of the usb2 directory. Without this change, we will hit permission issues. We'll have another article on the user and permission in container, for now, just set the uid:gid as we have done here.
Lastly, few words on volume, which is docker terminology and is not covered by oci runtime spec. Fundamentally, it is still mount, be it bind mount a directory (as we did in the mounting host_dir case) or really mount a volume device (as we did in the usb case). We can think volume as a "managed mount service from docker" with handy cli interface.
We talked how container will create a new mount namespace and jailed the processes inside of the container rootfs, and then we talked about how container use mount and bind mount to access and share the host device and directory. We skim the concept of volume from docker, which is fundamentally a "managed mount".