by Malik Algelly and Hugo Haldi
Linux namespaces
Docker networks
Focus on User Namespaces
Capabilities
$ hostname
ms-7917
# waiting for the new UTS
# namespace to change hostname
$ hostname
ms-7917
# change the hostname in the
# initial UTS namespace
$ hostname yggdrasil
$ unshare -u
$ hostname
ms-7917
# change the hostname in
# the new UTS namespace
$ hostname thor
$ hostname
thor
# wait for the initial UTS
# namespace to change hostname
$ hostname
thor
Schema from this conference by Michael Kerrisk
$ sudo unshare -p -f
$ echo $$
1
$ ps
PID TTY TIME CMD
39352 pts/3 00:00:00 sudo
39353 pts/3 00:00:00 unshare
39354 pts/3 00:00:02 bash
49657 pts/3 00:00:00 ps
$ sudo unshare -p -f -m
$ mount -t proc none /proc
$ echo $$
1
$ ps
PID TTY TIME CMD
1 pts/3 00:00:00 bash
32 pts/3 00:00:00 ps
ip netns add loki
ip link add eth0-l type veth peer name veth-l
ip link set eth0-l netns loki
ip link set veth-l up
ip address add 10.0.0.1/24 dev veth-l
ip netns exec loki ip link set lo up
ip netns exec loki ip link set eth0-l up
ip netns exec loki ip address add 10.0.0.2/24 dev eth0-l
References:
bridge
host
overlay
ipvlan
macvlan
none
third parties


From the Docker documentation
Similar to IPvlan but assign a MAC adress to containers, making them visible as real devices on the network.
Useful for applications that analyse network traffic.
None, Overlay, third parties drivers, …
CAP_NET_ADMIN can only modify network interfaces that are in a network NS owned by the process user NS.Schema from this conference by Michael Kerrisk
Goal?
If a program that has one or more capabilities is compromised, it has less opportunity to do damage than a root process
Rather than giving privileges to non-privileged processes,
allows you to remove privileges from the all powerful root
Effective : used by the kernel to any privilege check
Permitted : can be obtained with capset system call
Inheritable: can be inherited after an execve
Bounding : used to limit the capabilities that are gained during execve
Ambient : are preserved across an execve of a program that is not privileged.
The kernel supports associating capability sets with an executable file, similar to setuid.
Each file has 3 sets of capabilities
Effective, Permitted, Inhenitable
File capability sets are stored in an extended attribute named security.capability.
Writing to this attribute needs the CAP_SETFCAP capability.
File capability sets, in combination with the thread capability sets, will determine the capabilities of a thread after an execve.

Docker imposes some limitations with file capabilities.
Extended attributes are removed when Docker images are built.
“This means you will not normally have to concern yourself too much with file capabilities in containers."
dockerlabs
Exploit procfs and runC.
/proc/self/exe gives a symlink to the current process executable path/proc/self/fd is a directory containing the file descriptors open by the processrunc binarydocker exec) into an existing container which the attacker had previous write access torunc init subprocess to setup all needed restrictions on itself to prevent the called process to escape the containerrunc init will execve the requested binary and create a new process inside the container
The attack consist in replacing the binary to execute by /proc/self/exe so runc init will execute itself

Why is it working ? /proc/self/exe should be a symlink to something like /usr/sbin/runc, so execve should try to execute /usr/sbin/run from the container right ?
procfs is a special filesystem. /proc/[pid]/exe does not follow the normal semantics for symlinks. When a process open /proc/[pid]/exe the kernel gives access to the open file entry directly.
However, we cannot overwrite the runC binary while the process is running. But if the runC process exits, /proc/[runc-pid]/exe will disapear and we will lose the reference to the runC binary.
So we have to:
• first identify the PID of runC
• then open in read only /proc/[runc-pid]/exe
• finally, wait that runC exits to open in write mode /proc/self/fd/[ro-runc-fd]
Once the runC binary has been replaced, the system is infected and the attacker can gains root access to the host.
This breach has been patched by copying the runC executable inside a temporary filesystem so that any modifications made on runC will be discarded.
References and images:
We create a container with a malicious image and give the CAP_SYS_MODULE capability to the container. The image contains a kernel module that will create a reverse shell to the attack IP address.
The command insmod will insert a module into the kernel. Since we gave the CAP_SYS_MODULE capability to the container, the syscall will succeed and modify the host kernel.
The container has to run as root, as for the docker engine, otherwise the syscall will be denied.
The moral of this story is that you have to pay attention to what privileges and capabilities you give to a container, try to run containers as non-root user or use user namespace ID binding, and use root-less docker engine if you can.
References:
(joke)