Network namespaces and containers

Openstack can be deployed in several ways. For example, for testing and development, devstack is used. For production environments, Kolla-Ansible provides a set of tools to define, create and deploy a set of containers running the needed services. TripleO is another deployment tool that can be configured to run the services on containers.

Neutron, the networking orchestrator project, makes use of the network namespaces to isolate some operations. For example, the DHCP agent creates a namespace per network; this namespace is connected to other namespace via a veth pair. Inside this DHCP namespace, a dnsmasq process is running to attend all DHCP requests from any registered device (the dnsmasq daemon has a list of authorized MAC addresses).

Example of DHCP namespace:

$ ip netns exec qdhcp-2365b6b7-8532-4db2-ab7a-e5a6f9000db1 ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host 
valid_lft forever preferred_lft forever
28: tap0c79968e-a5: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether fa:16:3e:e8:c7:1d brd ff:ff:ff:ff:ff:ff
inet 10.0.10.2/26 brd 10.0.10.63 scope global tap0c79968e-a5
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fee8:c71d/64 scope link 
valid_lft forever preferred_lft forever

What happens when the DHCP agent is running inside a container? Well, in this case the container will create and modify, in the host, the network namespace to hold the dnsmasq process. Here is where the game begins.

How a network namespace is mounted in the filesystem.

Just as a reference, this is the code iproute2 and pyroute2 implement to add a new namespace. pyroute2 implementation is a translation from iproute2 code.

First, the directory is created; for example, /run/netns/ns01. Then the mounting point, common for all namespaces in the system, is mounted with flags MS_SHARED and MS_REC. This is very important because it makes possible for network namespace mounts to propagate between mount namespaces. After that the mounting point is bound recursively. At this point, the resource can be found in the filesystem as a shared mounted point:

$ findmnt -oTARGET,SOURCE,FSTYPE,PROPAGATION
TARGET                                SOURCE                 FSTYPE     PROPAGATION
...
│ └─/run/netns                        tmpfs[/netns]          tmpfs      shared

Once the common mounting point is created (this will be done only once in the system), the namespace mounting point is created; in our case /run/netns/ns01. Then the network namespace itself is created, by creating new IPv4 and IPv6 stacks, using unshare.

The last process is to bind the network namespace to /proc/self/ns/net. The directory /proc/[pid]/ns is a subdirectory containing one entry for each namespace that supports being manipulated by setns.

A network namespace lifespan, created with the ip command is, at most, until the system is restarted. A network namespace does not persist across reboots. When a network namespace is created with the unshare or clone system calls, it is bound to the life of the current process, so if your process exits, the network namespace is removed. Only a privileged process can make their network namespace persistent when there are no processes in it.

When is deleted using the ip command, the namespace is removed when the last process running inside it stops.

Let’s create some network namespaces.

In a rebooted system there are no namespaces created. The common mounting point for the network namespaces, /run/netns, is still not created.

TARGET                                SOURCE                 FSTYPE     PROPAGATION
...
├─/run                                tmpfs                  tmpfs      shared

Once we create the first namespace, two new mounting points appear in the filesystem: the netns directory and the namespace itself.

TARGET                                SOURCE                 FSTYPE     PROPAGATION
...
├─/run                                tmpfs                  tmpfs      shared
│ ├─/run/netns                        tmpfs[/netns]          tmpfs      shared
│ │ └─/run/netns/ns01                 nsfs[net:[4026532276]] nsfs       shared
│ └─/run/netns/ns01                   nsfs[net:[4026532276]] nsfs       shared

If the namespace is delete, the netns directory is left to be used by other namespaces and it’s keep as “shared”. What can happen if the propagation status is not shared? Let’s play a game: let’s create a private copy of the mount namespace, not shared with any other process, using unshare:

$ unshare --mount
$ findmnt -oTARGET,SOURCE,FSTYPE,PROPAGATION
TARGET                                SOURCE                 FSTYPE     PROPAGATION
...
├─/run                                tmpfs                  tmpfs      private

If, from this unshared mount namespace, we create a new network namespace, we’ll have a weird situation: both mount namespaces will share it’s own network namespace, the directory will be present in the filesystem but the mounting point won’t be shared between them.

TARGET                               SOURCE                 FSTYPE     PROPAGATION
# ISOLATED MOUNT NAMESPACE
...
├─/run                               tmpfs                  tmpfs      private
│ └─/run/netns                       tmpfs[/netns]          tmpfs      shared
│   └─/run/netns/ns02                nsfs[net:[4026532335]] nsfs       shared

# SYSTEM MOUNT NAMESPACE
...
├─/run                               tmpfs                  tmpfs      shared
│ ├─/run/netns                       tmpfs[/netns]          tmpfs      shared
│ │ └─/run/netns/ns01                nsfs[net:[4026532277]] nsfs       shared
│ └─/run/netns/ns01                  nsfs[net:[4026532277]] nsfs       shared

That will lead to a situation where the second network namespace is inaccessible:

$ ll /run/netns/
total 0
drwxr-xr-x  2 root root ?   80 Mar  8 20:31 ./
drwxr-xr-x 36 root root ? 1200 Mar  8 20:21 ../
-r--r--r--  1 root root ?    0 Mar  8 20:30 ns01
----------  1 root root ?    0 Mar  8 20:31 ns02
root@osdev18:~# ip netns 
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
ns02
ns01
root@osdev18:~# ip netns exec ns01 ip a
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@osdev18:~# ip netns exec ns02 ip a
setting the network namespace "ns02" failed: Invalid argument

And now from inside a container.

The goal is to be able to create a network namespace in the host. By default all files created inside a container are stored on a writeable container layer. Basically it means from the container filesystem there is no way to modify the host filesystem; that makes sense if one of the goals of a container is to isolate the running processes from the rest of the system.

Without any previous consideration, let’s create a container and then create a network namespace.

$ podman pull docker.io/library/fedora
$ podman run -d --name my_container -it fedora
$ podman exec -it my_container bash

# Now from inside the container
$ yum update -y; yum install iproute -y
$ ip netns add ns01
mount --make-shared /var/run/netns failed: Operation not permitted

Ok, something is not working: we need first to share the network namespace mounting point with the container. This is done using volumes. Due to the nature of this operation (needs to have root permissions), we also need to give extended privileges to the container.

$ podman run -d --name my_container -v /var/run/netns:/var/run/netns:shared --privileged -it fedora

Now it’s possible to install iproute and then create successfully a network namespace that will be accessible from the container and the host (or any other container with the correct privileges).

A little bug, big problems!

I’ve used podman in this example to spawn the container. I recommend you to read a bit about the container networking using podman (I’m still doing it!). When the package is installed, a default network configuration is installed into /etc/cni/net.d. This network configuration will define how rootfull and rootless containers can communicate between each other and the host. The default network configuration will create a network namespace to isolate the pod networking. That means, when using podman, the host directory /run/netns will be always created.

But that was not happening with docker. When the container was started, the directory /run/netns was still not created in the host. This is not a problem… unless you have an unpatched version of iproute2 or pyroute2 (big drama!). From the patch commit message (and the pyroute2 equivalent):

“When ip netns {add|delete} is first run, it bind-mounts /var/run/netns on top of itself, then marks it as shared. However, if there are already bind-mounts in the directory from other tools, these would not be propagated. Fix this by recursively bind-mounting.”

When a network namespace is created from inside a container, the namespace directories is created without the correct propagation flag. From the host, that is how the filesystem looks like:

TARGET                                SOURCE                 FSTYPE     PROPAGATION
...
│ ├─/run/netns                        tmpfs[/netns]          tmpfs      shared 
│ │ └─/run/netns/ns02                 nsfs[net:[4026532443]]
│ ├─/run/netns/ns02                   nsfs[net:[4026532443]]

The host, therefore, does not have access to the namespace created by this container:

$ ll /run/netns/
total 0
drwxr-xr-x  2 root root ?   80 Mar  9 20:31 ./
drwxr-xr-x 36 root root ? 1200 Mar  9 20:21 ../
----------  1 root root ?    0 Mar  9 20:31 ns02
root@osdev18:~# ip netns
Error: Peer netns reference is invalid.
ns02

This problem was solved in iproute2 v4.13.0 and pyroute2 v0.5.7.

How a network namespace is mounted in the filesystem.

Let’s create some network namespaces.

And now from inside a container.

A little bug, big problems!

Leave a Reply Cancel reply