Containers: How they are contained

Photo by Venti Views on Unsplash

Containers: How they are contained

Containerization is intriguing as it makes it so simple to run anything in a controlled and isolated environment without using a virtual machine. There is a lot of magic that goes behind in “containing” the containers to use limited resources.

Here we’ll take see how we can use namespaces and cgroups and namespaces to achieve that. This will help us understand how container orchestration platforms like Kubernetes manages node resources and how we can use them beyond setting limits on nodes.

What are cgroups?

Control Groups aka cgroups are a Linux Kernel feature that tracks and enforces resources allocation (CPU, Processes, I/O, memory, network bandwidth). 'How much a container can use?' is controlled using cgroups.

Processes on Linux always form a tree as each of them is a child of some other process. This means, a set of processes can be grouped and that's how cgroups enforces limits. For example, if a process is allocated to use only 500MB of memory, then sum of memory usage of all its children cannot exceed 500MB.

Warning: All shell commands need to executed as root user: Be very careful here!

These control groups reside under /sys/fs/cgroup/ mounted as cgroupfs - the cgroup filesystem. You'll find sub-trees of collection of Linux processes. Take a look at mount:

root@node0:~# mount |grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
root@node0:~#

That tells you, you are using cgroups v2 mounted on /sys/fs/cgroup/. In any of the sub-directories under /sys/fs/cgroup/, you'll find a set of files for each controller. Cgrpup controllers are a collection of parameters that control how and how much resources are allocated to a group of prcessess. Examples of controllers are io, memory, cpu, cpuset, pids etc.

For example, if you run a docker container, you can list memory controller files by doing this:

root@node0: # ls /sys/fs/cgroup/system.slice/docker-<container-hash>/memory.*
memory.current         memory.low        memory.oom.group  memory.stat          memory.swap.max
memory.events         memory.max        memory.peak     memory.swap.current  memory.zswap.current
memory.events.local  memory.min        memory.pressure     memory.swap.events   memory.zswap.max
memory.high         memory.numa_stat  memory.reclaim     memory.swap.high
root@node0: #

You'll find similar controller files for other controllers like cpu, pids etc.

Example demonstrating Out of Memory (OOM)

We'll need memhog to demonstrate OOM - it can be installed using numactl deb package.

sudo apt install numactl

memhog just hogs the amount of memory we tell it to. This will allocate 100MB of memory.

memhog 100M

Following one liner bash should be enough for our demonstration:

#!/bin/bash

for i in $(seq 10); do memhog 100M; done

Play with CGroup

We'll need another package called cgroup-tools to create and manage our cgroup.

sudo apt install cgroup-tools

Create a Cgroup

sudo cgcreate -g memory:memhog-demo

This creates a cgroup called memhog-demo where memory is the controller. This creates a directory /sys/fs/cgroup/memhog-demo/ where you'll find files where you can set values for respective controllers. Location of directory may vary based on your system.

All of these files have default value, for example:

root@node0: # cat /sys/fs/cgroup/memhog-demo/memory.max 
max
root@node0: #

This means all processes belonging to memhog-demo can use maximum memory available.

Set a limit for memory

root@node0: # cgset -r memory.max=50M memhog-demo
root@node0: # cat /sys/fs/cgroup/memhog-demo/memory.max 
52428800
root@node0: #

This sets a limit of 50MB (52428800 Bytes) on maximum memory allowed for this group. We also need to set the max swapping allowed if swap is enabled.

root@node0: # cgset -r memory.swap.max=0 memhog-demo

Now test our memory hogger

We have a cgroup with a memory limit of 50MB and a bash script that repeatedly tries to allocate 100MB. To test this we'll need to run the script in newly created cgroup called memhog-demo. It is important to run this process under memhog-demo group. Here it is:

sudo cgexec -g memory:memhog-demo ./memhog-demo.sh

And here is the outpt where we see the process getting killed each time it tries to allocate 100MB while memory.max is set to 50MB.

root@node0: # cgexec -g memory:memhog-demo /home/kanak/bin/memhog-demo.sh 
...../home/kanak/bin/memhog-demo.sh: line 4: 290436 Killed                  memhog 100M
...../home/kanak/bin/memhog-demo.sh: line 4: 290446 Killed                  memhog 100M
...../home/kanak/bin/memhog-demo.sh: line 4: 290457 Killed                  memhog 100M

That is exactly we expected, process gets killed if it tries to violate restrictions imposed by cgroups.

Limit number of processes

One of the cgroup controllers is called pids, you'll find follwoing files under group subtree:

root@node0: # /bin/ls /sys/fs/cgroup/memhog-demo/pids.*
/sys/fs/cgroup/memhog-demo/pids.current  /sys/fs/cgroup/memhog-demo/pids.max
/sys/fs/cgroup/memhog-demo/pids.events     /sys/fs/cgroup/memhog-demo/pids.peak
root@node0: #

We'll need to set pids.max to control the number of processes that can be forked.

root@node0: # cgset -r pids.max=5 memhog-demo

But that in itself is not enough to limit number of processes. On Linux, processes are tracked using the /prco psuedo filesytem and by default there is a proc filesystem mounted for the host OS. We will therefore need to have a way to mount an isolated proc for our cgroup. This is going to be beautiful...

Namespaces

Namespaces are another Linux kernel feature vital to achieve containerization. It partitions the kernel resources in such a way that a group of process can use one set of resources and is completely blind to resources belonging to host or other group. In other words, it allows you to create completely isolated environments within single OS.

There are several types of namespaces - IPC, Network, PID, Mount, UTS, User. We'll create a PID namespace, that will mount a proc filesystem and we'll then run a shell in the cgroup we created earlier.

root@node0: # cgexec -g memory:memhog-demo unshare --pid --fork --mount-proc

Recall that cgexec, a part of cgroup-tools executes a process in our cgroup memhog-demo. And above it executes unshare -fp --mount-proc shell command. Where unshare is a tool to create namespaces, it also manages how the namespace is created. In the command above we instruct unshare to:

  • Unshare the PID namespace using --pid option.

  • Fork as a child process using --fork oprion.

  • Mount /proc filesystem with --mount-proc option.

This will land us on a new shell with an isolated PID namespace, which means it'll see only itself and its children processes. Example:

root@node0: # cgexec -g memory:memhog-demo unshare --pid --fork --mount-proc
root@node0:~# 
root@node0:~# ps
    PID TTY          TIME CMD
      1 pts/1    00:00:00 bash
     13 pts/1    00:00:00 ps
root@node0:~#

As expected, we landed on a shell that thinks it is all alone 😏. If we look at /proc filesystem, we find it is excusive for this namespace. If you want to find where the cgroup is set, look at this:

root@node0:~# cat /proc/1/cgroup 
0::/memhog-demo
root@node0:~#

Where 1 is the process ID of shell we are running.

Limiting number of forks

Recall that we set pids.max in out demo cgroup to a value of 5. So we should be unable to fork more than five child processes in this namespace.

This is what we get if we try to create more than 5 processes:

root@node0:~# ps
    PID TTY          TIME CMD
      1 pts/1    00:00:00 bash
     25 pts/1    00:00:00 ps
root@node0:~# sleep 100 &
[1] 26
root@node0:~# sleep 100 &
[2] 27
root@node0:~# sleep 100 &
[3] 28
root@node0:~# sleep 100 &
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable

That's it, the cgroup restricted as expected.

Clean up

Don't worry about deleting the namespace we created, it gets removed when you exit from the unshared shell.

But we need to delete the cgroup:

root@node0: # cgdelete memory:memhog-demo

Conclusion

It was fascinating to discover how containers work inside single OS - all dancing together and completely oblivious of existence or others 😄. There are numerous other cgroup and namespaces features we have not covered but this gives us a fair idea of how containers are contained.

This also gives fair idea on how platforms like Kubernetes limit resource allocation on their pods and how important these features become while designing cloud infrastructure with cost in mind.