To Docker and Beyond!

by Malik Algelly and Hugo Haldi

Topics

Linux namespaces

Docker networks

Focus on User Namespaces

Capabilities

Few vulnerabilities and exploits

WTF are namespaces ?

Wraps a global system resource for isolation
Changes made to a resource within a namespace are visible to all the processes inside the namespace, but not to other processes
One common use is containerization

Linux Namespaces

Existing namespaces

UTS
PID
Mount
Network
User

IPC, Cgroup, Time, ...

UTS

Isolate hostname and domain name between processes
Changes made to the hostname from a process inside a UTS namespace are visible to all the processes within this namespace
However, the processes outside the UTS namespace cannot see the changes

A child fork inherit the UTS namespace of its parent
When a process creates a new UTS namespace, hostname and domain name are inherited from the caller

Demo

$ hostname
ms-7917

# waiting for the new UTS
# namespace to change hostname
$ hostname
ms-7917

# change the hostname in the
# initial UTS namespace
$ hostname yggdrasil

$ unshare -u
$ hostname
ms-7917

# change the hostname in
# the new UTS namespace
$ hostname thor
$ hostname
thor

# wait for the initial UTS
# namespace to change hostname
$ hostname
thor

PID

Isolate the process ID number space
Different process in different namespaces can have the same PID
The first process in this new namespace becomes the init process and get the PID 1 in this new namespace
Useful to migrate containers between systems without inducing collisions.

init becomes the parent of orphans processes inside the namespace
when it dies, the kernel will terminate all its children via SIGKILL.
PID namespaces have hierarchical relationship
Processes from parents PID namespaces can see all the processes within the child PID namespaces. The opposite is not true.

PID Namespaces hierarchie

Schema from this conference by Michael Kerrisk

Demo

$ sudo unshare -p -f
$ echo $$
1
$ ps
    PID TTY          TIME CMD
  39352 pts/3    00:00:00 sudo
  39353 pts/3    00:00:00 unshare
  39354 pts/3    00:00:02 bash
  49657 pts/3    00:00:00 ps

Mount

Isolation of mount points per process
Copy the mount namespace of the caller

Demo (contd)

$ sudo unshare -p -f -m
$ mount -t proc none /proc
$ echo $$
1
$ ps
    PID TTY          TIME CMD
      1 pts/3    00:00:00 bash
     32 pts/3    00:00:00 ps

Network

Per process isolation on network devices
One network device can be attached to exactly one network namespace
We can use virtual network device pair to provide pipe-like tunnels between networks, attaching each element of the pair to a specific network namespace
When a namespace is freed, a physical device is moved back to the initial network namespace while a virtual device is destroyed

Demo

ip netns add loki
ip link add eth0-l type veth peer name veth-l
ip link set eth0-l netns loki
ip link set veth-l up
ip address add 10.0.0.1/24 dev veth-l
ip netns exec loki ip link set lo up
ip netns exec loki ip link set eth0-l up
ip netns exec loki ip address add 10.0.0.2/24 dev eth0-l

References:

Conference of Jérôme Petazzoni at DockerCon 2015 Barcelona

Docker Networks

Network types

bridge

host

overlay

ipvlan

macvlan

none

third parties

Bridge network

Default network driver (when you don’t specify a network)
Software bridge that is used to connect containers to communicate
Only the containers connected to the same bridge can see each others

Network bridge

Host network

Do not create a new network namespace
Run the container inside the current network namespace

IPvlan L2

From the Docker documentation

MACvlan

Similar to IPvlan but assign a MAC adress to containers, making them visible as real devices on the network.

Useful for applications that analyse network traffic.

And more

None, Overlay, third parties drivers, …

User namespaces

Isolate identifiers and attributes ({U,G}ID, root directory, capabilities, …)
UID and GID can be different inside and outside a user NS
Process can have unprivilieged UID outside a user NS, while having a UID of 0 inside the user NS.
The process that creates the new user NS gains all the capabilities inside the new user NS.

Hierarchy

User NS have a hierarchical relationship, every of them have a parent except for the initial user NS.
User NS can have multiple children user NS
Maxumim of 32 nested user NS.

A process is member of exactly one user NS
A user NS can own other NS (network, PID, …)
Capabilities only apply to the resources that are member of the NS owned by the current user NS
E.g. a process having CAP_NET_ADMIN can only modify network interfaces that are in a network NS owned by the process user NS.

User namespace example

Schema from this conference by Michael Kerrisk

CAPABILITIES

From a binary privileges perspective
to a set of 41 capabilities

Goal?

If a program that has one or more capabilities is compromised, it has less opportunity to do damage than a root process

The way they should be seen

Rather than giving privileges to non-privileged processes,

allows you to remove privileges from the all powerful root

Thread capabilities sets

Each user namespace has 5 sets of capabilities that define the capabilities owned, those that can be earned and those that can be passed on through forks.
The sets: Effective, Permitted, Inheritable, Bounding, Ambient

Thread capabilities sets

Effective : used by the kernel to any privilege check

Permitted : can be obtained with capset system call

Inheritable: can be inherited after an execve

Bounding : used to limit the capabilities that are gained during execve

Ambient : are preserved across an execve of a program that is not privileged.

File capabilities sets

The kernel supports associating capability sets with an executable file, similar to setuid.

Each file has 3 sets of capabilities

Effective, Permitted, Inhenitable

File capabilities sets

File capability sets are stored in an extended attribute named security.capability.

Writing to this attribute needs the CAP_SETFCAP capability.

File capability sets, in combination with the thread capability sets, will determine the capabilities of a thread after an execve.

Capabilities set theorie

File capabilities in Docker

Docker imposes some limitations with file capabilities.

Extended attributes are removed when Docker images are built.

“This means you will not normally have to concern yourself too much with file capabilities in containers."
dockerlabs

Exploits

CVE-2019-5736: A runC vulnerability

Principle

Exploit procfs and runC.

/proc/self/exe gives a symlink to the current process executable path
/proc/self/fd is a directory containing the file descriptors open by the process
runC is a container runtime for Docker (default)

Vulnerability

Allows a malicious container to overwrite the host runc binary
Can exploit any of this two situations:
- Creating a new container using a malicious image
- Attaching (docker exec) into an existing container which the attacker had previous write access to

When creating or attaching to a container, runC creates a runc init subprocess to setup all needed restrictions on itself to prevent the called process to escape the container
Then runc init will execve the requested binary and create a new process inside the container

runC init

The attack consist in replacing the binary to execute by /proc/self/exe so runc init will execute itself

runC init trick

Why is it working ? /proc/self/exe should be a symlink to something like /usr/sbin/runc, so execve should try to execute /usr/sbin/run from the container right ?

procfs is a special filesystem. /proc/[pid]/exe does not follow the normal semantics for symlinks. When a process open /proc/[pid]/exe the kernel gives access to the open file entry directly.

However, we cannot overwrite the runC binary while the process is running. But if the runC process exits, /proc/[runc-pid]/exe will disapear and we will lose the reference to the runC binary.

So we have to:

• first identify the PID of runC

• then open in read only /proc/[runc-pid]/exe

• finally, wait that runC exits to open in write mode /proc/self/fd/[ro-runc-fd]

Once the runC binary has been replaced, the system is infected and the attacker can gains root access to the host.

This breach has been patched by copying the runC executable inside a temporary filesystem so that any modifications made on runC will be discarded.

References and images:

Exploit capabilities: CAP_SYS_MODULE

Principle

Run an infected image with CAP_SYS_MODULE capability
Exploit the kernel module injection within a running container
Open a reverse shell to a listening socket in the attacker system

We create a container with a malicious image and give the CAP_SYS_MODULE capability to the container. The image contains a kernel module that will create a reverse shell to the attack IP address.

The command insmod will insert a module into the kernel. Since we gave the CAP_SYS_MODULE capability to the container, the syscall will succeed and modify the host kernel.

Demo

The container has to run as root, as for the docker engine, otherwise the syscall will be denied.

The moral of this story is that you have to pay attention to what privileges and capabilities you give to a container, try to run containers as non-root user or use user namespace ID binding, and use root-less docker engine if you can.

References:

Questions

Describe (in details) the state of the capabilities sets of a child process after a fork
How you would couter the patch to the CVE-2019-5736 ?

(joke)

Questions (more seriously)

Give one example of good security practice with containers.
What is the objective of using containers ?

To Docker and Beyond!

Topics

WTF are namespaces ?

Linux Namespaces

Existing namespaces

UTS

Demo

PID

Demo

Mount

Demo (contd)

Network

Demo

Docker Networks

Network types

Bridge network

Host network

IPvlan L2

MACvlan

And more

User namespaces

Hierarchy

CAPABILITIES

From a binary privileges perspectiveto a set of 41 capabilities

The way they should be seen

Thread capabilities sets

Thread capabilities sets

File capabilities sets

File capabilities sets

File capabilities in Docker

Exploits

CVE-2019-5736: A runC vulnerability

Principle

Vulnerability

Exploit capabilities: CAP_SYS_MODULE

Principle

Demo

Questions

Questions (more seriously)

From a binary privileges perspective
to a set of 41 capabilities