Custom BPF firewalls for systemd services

The upcoming systemd release 243 will allow to specify custom BPF programs as firewall for systemd services. I implemented this as coding task from my interview at Kinvolk.

BPF programs are small bytecode programs the Linux kernel can run in its network stack or attached to syscalls and other events. BPF is sometimes called eBPF because it extends the original Berkley Packet Filter (BPF) used to pre-filter packets for RAW sockets (this classic BPF is now called cBPF). Read more about BPF here.

The usage of BPF in this post here is limited to programs attached as filters for IP packets of all sockets in a cgroup (control group). Cgroups are actually hierarchical which also means that filters of the parent cgroup will also apply. In addition, multiple BPF programs per cgroup may be attached. The filtering happens after IP reassembly/before fragmentation.

Systemd puts the service (unit) process in a new cgroup and any children processes will also stay there. In 2017 systemd got the ability to have per-unit IP accounting and filtering based on IP addresses which is done with exactly those cgroup socket filters. There was a plan to add port filtering as well but the way systemd ships BPF code meant that it had to be done in BPF assembly.

The task I took up was planned here and proposed to offload the BPF program creation to the user instead of having it done by systemd. This allows to have port filters or even HTTP-aware filters supplied by an external program by pinning the BPF programs to /sys/fs/bpf/….

My commit adds two new properties for systemd units. IPIngressFilterPath and IPEgressFilterPath both take a path to a pinned BPF program. While the properties take a single path, you can specify them multiple times to attach more than one filter. They are combine as queue. An empty assignment to the property resets all previous filters. Of course we don’t have to drop any packets and can also use this machinery to monitor or modify packets.

The filters can be specified in service unit files, or with systemd-run or systemd-nspawn. I will start with giving an example that uses systemd-run, first with a simple filter compiled from C, then with an interactive MTU filter. You can find the examples in this repository as well.

Simple dropping filter example

The simplest way of writing a BPF program is using C and compiling with clang to BPF bytecode (but there are some limitations and workarounds to get to know). Here is a source file with a filter that just drops any packet:

/* cgroup/skb BPF prog */
#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

__section("filter")
int cgroup_socket_drop(struct __sk_buff *skb)
{
    /* analyze skb content here */
    return 0; /* 0 = drop, 1 = forward */
}

char __license[] __section("license") = "GPL";

This can be compiled to an object file:

$ clang -O2 -Wall -target bpf -c cgroup-sock-drop.c -o cgroup-sock-drop.o

To load and pin it to the BPF filesystem we will use the bpftool binary. Install bpftool as package in Fedora (sudo dnf install bpftool) or compile and copy it to your PATH from a kernel source directory (cd ~/linux-source-x.xx/tools/bpf/bpftool ; make bpftool ; cp bpftool ~/.local/bin/).

Then we can load our small filter:

$ sudo `which bpftool` prog load cgroup-sock-drop.o /sys/fs/bpf/cgroup-sock-drop type cgroup/skb

Unloading is simply done with rm /sys/fs/bpf/cgroup-sock-drop and is needed before an updated version can be pinned to that location.

Using a filter with systemd-run/nspawn

Given you read this when the future systemd 243 release is installed, we can now use our BPF program as ingress filter for a temporary ping service:

$ sudo systemd-run -p IPIngressFilterPath=/sys/fs/bpf/cgroup-sock-drop --scope ping 127.0.0.1
Running scope as unit: run-re62ba1c….scope
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
^C # cancel since it will not get responses
--- 127.0.0.1 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 186ms

We could specify more than one ingress filter or egress filter but since the first one already drops everything there would not be any difference. The syntax for systemd-nspawn is similar and allows to use our BPF filter as firewall for a whole container: systemd-nspawn --property=IPIngressFilterPath=/sys/fs/bpf/cgroup-sock-drop ….

Using service files

Since service files will probably be run at boot time, we need a service in /etc/systemd/system/cgroup-sock-drop-filter.service that loads and pins our filter (change the /path/to parts):

[Unit]
Description=cgroup socket drop filter

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/path/to/bpftool prog load /path/to/cgroup-sock-drop.o /sys/fs/bpf/cgroup-sock-drop-filter type cgroup/skb
ExecStop=rm /sys/fs/bpf/cgroup-sock-drop-filter
LimitMEMLOCK=infinity

Note the last line without the loading will fail. We can use this filter now in multiple services, both as ingress or egress filter. Here comes our ping example again as /etc/systemd/system/my-ping.service:

[Unit]
Description=my ping service
Requires=cgroup-sock-drop-filter.service
After=cgroup-sock-drop-filter.service

[Service]
ExecStart=ping 127.0.0.1
IPIngressFilterPath=/sys/fs/bpf/cgroup-sock-drop-filter

Of course you need some more other lines to run it at startup as usual (e.g., After=network.target and an [Install] section with WantedBy=multi-user.target). Anyway we now run systemctl start my-ping.service and see its output when we run systemctl status my-ping.service. The output should be the same as previously.

Interactive MTU filter example

BPF programs and userspace programs can communicate through BPF maps. This makes it possible to write a BPF filter that drops packets, e.g., larger than a certain size stored in a BPF map. Then a userspace program can change this value dynamically. In the repository linked above you can find a small load_and_control_filter tool doing that in the standalone folder. It simulates the behavior of a small MTU (Maximum Transfer Unit) on a network path or packet drops when the MTU is changed to 0.

The tool loads a BPF cgroup ingress/egress filter bytecode that filters based on the packet size. It then pins the BPF filter to a given location in /sys/fs/bpf/. Through the +/- keys the MTU can be changed interactively (actually changes the values in the BPF map). Optionally the initial MTU value can be specified on startup. The program can also attach the BPF filter to a cgroup by specifying the cgroup by its path. The BPF filter stays loaded when the program exits and has to be deleted manually with rm.

In contrast to the first example, it does not use a BPF compiler but uses hardcoded BPF assembly instructions to include the BPF code in the final program. This is not very accessible for hacking on it but for me it was interesting to see how BPF instructions work and what needs to be done to comply with the verifier.

In one terminal we can run the interactive filter:

$ sudo ./load_and_control_filter -m 100 -t ingress /sys/fs/bpf/ingressfilter
cgroup dropped 0 packets, forwarded 0 packets, MTU is 100 bytes (Press +/- to change)
… # keeps running

This loads the BPF filter to /sys/fs/bpf/ingressfilter which we can use for a systemd service the same way as the first dropping filter.

In a second terminal we will run ping again as root in a temporary systemd scope and specify our filter as IPIngressFilterPath:

$ sudo systemd-run -p IPIngressFilterPath=/sys/fs/bpf/ingressfilter --scope ping 127.0.0.1
Running scope as unit: run-….scope
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.069 ms
…

When we switch back to the first terminal and press -, the new MTU is 50 bytes and we can see the dropped packet count increase. In the ping terminal we will see no new responses because they are all dropped.

Attaching a filter to a cgroup without systemd 243

The load_and_control_filter program can be told to attach the filter to a cgroup of a systemd service. This just needs the path of the cgroup which is derived by the unit name.

Sidenote: Systemd uses a BPF filter for its IP accounting and firewalling based on IP addresses. If such a filter is present but no others, the flag to allow multiple BPF filters for a cgroup is missing. As workaround when, e.g., IP accounting is enabled, we can tell systemd that the cgroup management is done by externally. This means that systemd will use the flag to allow multiple BPF filters instead of loading the IP accounting BPF filter without this flag.

Using systemd-run again for a temporary service/scope:

$ sudo systemd-run -p IPAccounting=yes -p Delegate=yes --scope ping 127.0.0.1
Running scope as unit: run-r9f31b3947f4c4a11a24babf5517fe025.scope
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.069 ms
…

We can see the scope name in the first output line. This is also the last part of the cgroup path we have to use as argument in order to attach the filter to the cgroup.

$ sudo ./load_and_control_filter -m 100 -c /sys/fs/cgroup/unified/system.slice/run-r9f31b3947f4c4a11a24babf5517fe025.scope -t ingress /sys/fs/bpf/myfilter
cgroup dropped 0 packets, forwarded 4 packets, MTU is 100 bytes (Press +/- to change)
… # keeps running and increases the forward count

Now we can hit - to reduce the MTU and observe the packet drop count increasing while no ping responses can be seen. We could also start an additional egress filter to observe what happens to ping when outgoing packets are dropped. This is actually propagated through the sendmsg syscall returning Operation not permitted.

I hope that someone will use the new systemd properties for more interesting filters than the both above. The interactive MTU filter may be useful to try out if PLPMTU kicks in for a TCP connection with echo 2 > /proc/sys/net/ipv4/tcp_mtu_probing and appropriate values in …/tcp_base_mss and …/tcp_probe_*.

I also hope that this makes BPF filters more accessible for users. I imagine that a filter for, e.g., HTTP paths would be loaded and combined with another one for ports, both provided by systemd services in a reusable fashion (templates with @?). Maybe a filter has enough generality or contextual information to be useful for multiple services so that they all share a single instance. Maybe it also turns out to be useful for user services (test that with systemd-run --user …). You can reach out to me via mail or GitHub issues in the repository for the above examples.

The feature is actually quite simple and I don’t blog often but a feature that is not explained well may stay unused. I’m thankful for the idea to work on this, provided to me in my interview at Kinvolk. My experience in GNOME and the research at KAIST’s Advanced Networking Lab with Prof. Sue Moon have probably helped me to get this done. The last word of thanks goes to Lennart Poettering for the review and guidance to get the PR merged.