Containerization is intriguing as it makes it so simple to run anything in a controlled and isolated environment without using a virtual machine. There is a lot of magic that goes behind in “containing” the containers to use limited resources.
Here we’ll take see how we can use namespaces and cgroups
and namespaces
to achieve that. This will help us understand how container orchestration platforms like Kubernetes manages node resources and how we can use them beyond setting limits on nodes.
What are cgroups
?
Control Groups aka cgroups
are a Linux Kernel feature that tracks and enforces resources allocation (CPU, Processes, I/O, memory, network bandwidth). 'How much a container can use?' is controlled using cgroups
.
Processes on Linux always form a tree as each of them is a child of some other process. This means, a set of processes can be grouped and that's how cgroups
enforces limits. For example, if a process is allocated to use only 500MB
of memory, then sum of memory usage of all its children cannot exceed 500MB
.
Warning: All shell commands need to executed as root user: Be very careful here!
These control groups reside under /sys/fs/cgroup/
mounted as cgroupfs
- the cgroup filesystem. You'll find sub-trees of collection of Linux processes. Take a look at mount:
root@node0:~# mount |grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
root@node0:~#
That tells you, you are using cgroups v2
mounted on /sys/fs/cgroup/
. In any of the sub-directories under /sys/fs/cgroup/
, you'll find a set of files for each controller
. Cgrpup controllers are a collection of parameters that control how and how much resources are allocated to a group of prcessess. Examples of controllers are io
, memory
, cpu
, cpuset
, pids
etc.
For example, if you run a docker
container, you can list memory
controller files by doing this:
root@node0: # ls /sys/fs/cgroup/system.slice/docker-<container-hash>/memory.*
memory.current memory.low memory.oom.group memory.stat memory.swap.max
memory.events memory.max memory.peak memory.swap.current memory.zswap.current
memory.events.local memory.min memory.pressure memory.swap.events memory.zswap.max
memory.high memory.numa_stat memory.reclaim memory.swap.high
root@node0: #
You'll find similar controller files for other controllers like cpu
, pids
etc.
Example demonstrating Out of Memory (OOM)
We'll need memhog
to demonstrate OOM - it can be installed using numactl
deb package.
sudo apt install numactl
memhog
just hogs the amount of memory we tell it to. This will allocate 100MB of memory.
memhog 100M
Following one liner bash
should be enough for our demonstration:
#!/bin/bash
for i in $(seq 10); do memhog 100M; done
Play with CGroup
We'll need another package called cgroup-tools
to create and manage our cgroup.
sudo apt install cgroup-tools
Create a Cgroup
sudo cgcreate -g memory:memhog-demo
This creates a cgroup
called memhog-demo
where memory
is the controller. This creates a directory /sys/fs/cgroup/memhog-demo/
where you'll find files where you can set values for respective controllers. Location of directory may vary based on your system.
All of these files have default value, for example:
root@node0: # cat /sys/fs/cgroup/memhog-demo/memory.max
max
root@node0: #
This means all processes belonging to memhog-demo
can use maximum memory available.
Set a limit for memory
root@node0: # cgset -r memory.max=50M memhog-demo
root@node0: # cat /sys/fs/cgroup/memhog-demo/memory.max
52428800
root@node0: #
This sets a limit of 50MB (52428800 Bytes) on maximum memory allowed for this group. We also need to set the max swapping allowed if swap is enabled.
root@node0: # cgset -r memory.swap.max=0 memhog-demo
Now test our memory hogger
We have a cgroup
with a memory limit of 50MB and a bash script that repeatedly tries to allocate 100MB. To test this we'll need to run the script in newly created cgroup
called memhog-demo
. It is important to run this process under memhog-demo
group. Here it is:
sudo cgexec -g memory:memhog-demo ./memhog-demo.sh
And here is the outpt where we see the process getting killed each time it tries to allocate 100MB while memory.max
is set to 50MB.
root@node0: # cgexec -g memory:memhog-demo /home/kanak/bin/memhog-demo.sh
...../home/kanak/bin/memhog-demo.sh: line 4: 290436 Killed memhog 100M
...../home/kanak/bin/memhog-demo.sh: line 4: 290446 Killed memhog 100M
...../home/kanak/bin/memhog-demo.sh: line 4: 290457 Killed memhog 100M
That is exactly we expected, process gets killed if it tries to violate restrictions imposed by cgroups
.
Limit number of processes
One of the cgroup
controllers is called pids
, you'll find follwoing files under group subtree:
root@node0: # /bin/ls /sys/fs/cgroup/memhog-demo/pids.*
/sys/fs/cgroup/memhog-demo/pids.current /sys/fs/cgroup/memhog-demo/pids.max
/sys/fs/cgroup/memhog-demo/pids.events /sys/fs/cgroup/memhog-demo/pids.peak
root@node0: #
We'll need to set pids.max
to control the number of processes that can be forked.
root@node0: # cgset -r pids.max=5 memhog-demo
But that in itself is not enough to limit number of processes. On Linux, processes are tracked using the /prco
psuedo filesytem and by default there is a proc
filesystem mounted for the host OS. We will therefore need to have a way to mount an isolated proc
for our cgroup
. This is going to be beautiful...
Namespaces
Namespaces are another Linux kernel feature vital to achieve containerization. It partitions the kernel resources in such a way that a group of process can use one set of resources and is completely blind to resources belonging to host or other group. In other words, it allows you to create completely isolated environments within single OS.
There are several types of namespaces
- IPC, Network, PID, Mount, UTS, User. We'll create a PID namespace, that will mount a proc
filesystem and we'll then run a shell in the cgroup
we created earlier.
root@node0: # cgexec -g memory:memhog-demo unshare --pid --fork --mount-proc
Recall that cgexec
, a part of cgroup-tools
executes a process in our cgroup
memhog-demo
. And above it executes unshare -fp --mount-proc
shell command. Where unshare
is a tool to create namespaces
, it also manages how the namespace is created. In the command above we instruct unshare
to:
Unshare the PID namespace using
--pid
option.Fork as a child process using
--fork
oprion.Mount
/proc
filesystem with--mount-proc
option.
This will land us on a new shell with an isolated PID namespace, which means it'll see only itself and its children processes. Example:
root@node0: # cgexec -g memory:memhog-demo unshare --pid --fork --mount-proc
root@node0:~#
root@node0:~# ps
PID TTY TIME CMD
1 pts/1 00:00:00 bash
13 pts/1 00:00:00 ps
root@node0:~#
As expected, we landed on a shell that thinks it is all alone 😏. If we look at /proc
filesystem, we find it is excusive for this namespace. If you want to find where the cgroup is set, look at this:
root@node0:~# cat /proc/1/cgroup
0::/memhog-demo
root@node0:~#
Where 1 is the process ID of shell we are running.
Limiting number of forks
Recall that we set pids.max
in out demo cgroup
to a value of 5. So we should be unable to fork more than five child processes in this namespace.
This is what we get if we try to create more than 5 processes:
root@node0:~# ps
PID TTY TIME CMD
1 pts/1 00:00:00 bash
25 pts/1 00:00:00 ps
root@node0:~# sleep 100 &
[1] 26
root@node0:~# sleep 100 &
[2] 27
root@node0:~# sleep 100 &
[3] 28
root@node0:~# sleep 100 &
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
That's it, the cgroup
restricted as expected.
Clean up
Don't worry about deleting the namespace we created, it gets removed when you exit
from the unshare
d shell.
But we need to delete the cgroup
:
root@node0: # cgdelete memory:memhog-demo
Conclusion
It was fascinating to discover how containers work inside single OS - all dancing together and completely oblivious of existence or others 😄. There are numerous other cgroup
and namespaces
features we have not covered but this gives us a fair idea of how containers are contained.
This also gives fair idea on how platforms like Kubernetes limit resource allocation on their pods and how important these features become while designing cloud infrastructure with cost in mind.