Custom port-based BPF firewall loader for systemd services, 2nd part

Update at the end of the post

The first post introduced the IPIngressFilterPath and IPEgressFilterPath feature for the upcoming systemd release 243 to support custom BPF programs as firewall for services. There was a simple example for a minimal BPF program written in C and how to load it for a systemd service. Then there was a standalone program to dynamically control a BPF filter to drop based on packet size.

Now I will show a more realistic use case: a configurable firewall. The new code here can be found in the same repository as the first examples.

Some time ago, a port-based BPF firewall was planned to complement the inbuilt IP-address-based BPF firewall of systemd. It was written in BPF assembly but the PR was not merged mainly because of this fact. The new IPIngressFilterPath and IPEgressFilterPath features in v243 can be used to write a port-based filter as custom BPF program but it’s up to the user to do that.

This blogpost’s program allows you to filter IP packets by ports without writing any BPF code yourself. It not only useful on its own but it also answeres the question how the empty C template from the first post has to be filled out with more than a return 0; which drops all packets.

The road to a working BPF program is quite rough when you get started. There are many BPF examples online but they might use the wrong BPF program type: here we use the cgroup/skb type which works on skb socket buffer pointers, however, the skb data is not an Ethernet frame but an IP packet. Using the old style of helper functions to access packet bytes is also forbidden.

When you find examples of BPF code it is good to answer the basic question where the helper functions come from. There is no unified BPF library and the examples can use different helper definitions (sometimes it’s a simple name change). Here I’ve taken them from the iproute2 repository and not from the kernel repository but there are more to watch out for.

If you just want to check some ports or parse the HTTP header, you first need to parse the IP packet which can be a daunting task because even if you don’t need to handle Ethernet frames you still have to skip through the IPv4/v6 headers with varying length. The best is to reuse the kernel structs by casting the addresses of packet offsets to them. It’s not hard but unfamiliar if you haven’t dealt with kernel internals and IP headers before. Not to mention the tricks needed to pass the BPF verifier when trying to load the compiled bytecode into the kernel.

The new BPF program parses the packet headers and drops or forwards the packets based on the metadata extracted from the packet headers. The forward/drop configuration is static for now, meaning that you need to reload the BPF program to change the configuration. The forward/drop filter expression can involve destination and source ports for TCP and UDP, the IP version (being IPv4 or IPv6) and the protocol types ICMP, TCP, or UDP.

The filter expression is a C expression over the boolean variables udp, tcp, icmp, ip, and ipv6 denoting the packet type and over the the integers dst_port and src_port for the UDP/TCP ports. The filter is passed as macro definition to the C BPF program. Valid filters examples are FILTER='icmp || (udp && dst_port == 53) || (tcp && dst_port == 80)' or FILTER='!udp || dst_port == 53'.

The following C BPF program can be compiled through (assuming the environment variable FILTER is set):

$ clang -O2 -Wall -target bpf -c port-firewall.c -o port-firewall.o -D "FILTER=${FILTER}"

You can find the source with a makefile to compile and load the BPF program here in the port-firewall directory.

/* Copyright 2019 Kai Lüke <kailueke@riseup.net>
 * SPDX-License-Identifier: GPL-2.0
 *
 * Minimal configurable packet filter, parses IP/IPv6 packets, ICMP, UDP ports,
 * and TCP ports. The forward rule is a C expression passed as FILTER variable
 * to the compiler with -D. The expression can use the boolean variables
 * [udp, tcp, icmp, ip, ipv6] and the integers [dst_port, src_port].
 * If the expression evaluates to 0 (false), the packet will be dropped.
 */
#include <linux/bpf.h>
#include "bpf_api.h"
#include <linux/in.h>
#include <linux/if.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/if_tunnel.h>
#include <linux/icmp.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <sys/socket.h>
#include <linux/if_packet.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

/* cgroup/skb BPF prog */
__section("filter")
int port_firewall(struct __sk_buff *skb) {
  __u8 udp = 0, tcp = 0, icmp = 0, ip = 0, ipv6 = 0;
  __u16 dst_port = 0;
  __u16 src_port = 0;

  void *data = (void *)(long)skb->data;
  void *data_end = (void *)(long)skb->data_end;

  ip = skb->protocol == htons(ETH_P_IP);
  ipv6 = skb->protocol == htons(ETH_P_IPV6);

  if (ip) {
    if (data + sizeof(struct iphdr) > data_end) { return 0; }
    struct iphdr *ip = data;
    /* IP fragmentation does not need to be handled here for cgroup skbs */
    icmp = ip->protocol == IPPROTO_ICMP;
    tcp = ip->protocol == IPPROTO_TCP;
    udp = ip->protocol == IPPROTO_UDP;
    if (udp || tcp) {
      __u8 *ihlandversion = data;
      __u8 ihlen = (*ihlandversion & 0xf) * 4;
      if (data + ihlen + sizeof(struct tcphdr) > data_end) { return 0; }
      struct tcphdr *tcp = data + ihlen;
      src_port = ntohs(tcp->source);
      dst_port = ntohs(tcp->dest);
    }
  } else if (ipv6) {
    struct ipv6hdr *ipv6 = data;
    __u8 ihlen = sizeof(struct ipv6hdr);
    if (((void *) ipv6) + ihlen > data_end) { return 0; }
    __u8 proto = ipv6->nexthdr;
    #pragma unroll
    for (int i = 0; i < 8; i++) { /* max 8 extension headers */
      icmp = proto == IPPROTO_ICMPV6;
      tcp = proto == IPPROTO_TCP;
      udp = proto == IPPROTO_UDP;
      if (udp || tcp) {
        if (((void *) ipv6) + ihlen + sizeof(struct tcphdr) > data_end) { return 0; }
        struct tcphdr *tcp = ((void *) ipv6) + ihlen;
        src_port = ntohs(tcp->source);
        dst_port = ntohs(tcp->dest);
      }
      if (icmp || udp || tcp) {
        break;
      }
      if (proto == IPPROTO_FRAGMENT || proto == IPPROTO_HOPOPTS ||
          proto == IPPROTO_ROUTING || proto == IPPROTO_AH || proto == IPPROTO_DSTOPTS) {
        if (((void *) ipv6) + ihlen + 2 > data_end) { return 0; }
        ipv6 = ((void *) ipv6) + ihlen;
        proto = *((__u8 *) ipv6);
        if (proto == IPPROTO_FRAGMENT) {
          ihlen = 8;
        } else {
          ihlen = *(((__u8 *) ipv6) + 1) + 8;
        }
        if (((void *) ipv6) + ihlen > data_end) { return 0; }
      } else {
        break;
      }
    }
  }

  if (FILTER) {
    return 1; /* 1 = forward */
  }
  return 0; /* 0 = drop */
}

char __license[] __section("license") = "GPL";

The checks with > data_end are needed to ensure that only valid memory is accessed. The loop unroll is another thing done for the BPF verifier because support for loops is not available yet. I used the calculation ihlandversion & 0xf for the IPv4 header length because I didn’t test if using the kernel struct union byte works on both big and little endian systems (I guess it should work). If you have improvements please share them on the GitHub repository.

The makefile requires to pass the filter when the program is built:

$ make FILTER='…'
$ # now we can load to /sys/fs/bpf/port-firewall:
$ make load

After compilation the makefile can load the program via sudo $(which bpftool) prog load port-firewall.o /sys/fs/bpf/port-firewall type cgroup/skb.

As mentioned, systemd v243 has the options IP(Ingress|Egress)FilterPath= for use in service files or systemd-run and systemd-nspawn. Look up how to use them in the first post or in the systemd documentation. It would be fun to use systemd’s service file template mechanism for the filter-loading units to instantiate on-demand compilation and loading of filters based on the forward/drop expression. If you write a loading service for a single filter expression, look at the first post and the bpf-program-only/ folder in the repository on how to do that.

For quick testing when systemd v243 is not available you can use systemd-run to spawn a shell in a new cgroup: either as system service with sudo systemd-run --scope -S or as user service with systemd-run --user --scope -S (where -S can be replaced with a concrete binary instead of starting a shell). This will print out the unit name which is also the name of the cgroup. The full cgroup path for the system service shell is /sys/fs/cgroup/unified/system.slice/NAME. For the user service shell the path is /sys/fs/cgroup/unified/user.slice/user-1000.slice/user@1000.service/NAME depening on your UID not being 1000.

Then attach the BPF program to the cgroup: sudo $(which bpftool) cgroup attach /sys/fs/cgroup/unified/user.slice/user-1000.slice/user@1000.service/run-rfaa93ac79de2482d8ef1870fd6b508cd.scope egress pinned /sys/fs/bpf/port-firewall multi. You can either choose ingress or egress to filter incoming or outgoing packets. You can load the same filter for both ingress and egrees and you can load multiple different filters per ingress/egress (also true when used through the systemd v243 option). As mentioned in the sidenote of the first post, if you turn on IPAccounting in systemd-run you need to turn on Delegate as well to allow multiple BPF programs.

For your convenience I’ve again included example systemd service files, bpf-make.service to configure and load the BPF program, and my-filtered-ping.service which uses the loading service. They differ a bit from the ones in bpf-program-only folder from the first post because they use the makefile to first compile the program with a configuration.

Here is bpf-make.service which will be used as helper service for our ping service:

[Unit]
Description=BPF port-firewall load service

[Service]
Type=oneshot
RemainAfterExit=yes
# If bpftool is not installed system-wide use: Environment="PATH=/bin:/usr/bin:/path/to/bpftool-folder"
Environment='FILTER=icmp || (udp && dst_port == 53) || (tcp && dst_port == 80)'
ExecStart=/usr/bin/make -C /path/to/repo/bpf-cgroup-filter/port-firewall ; make -C /path/to/repo/bpf-cgroup-filter/port-firewall load
ExecStop=rm /sys/fs/bpf/port-firewall
LimitMEMLOCK=infinity

Here is my-filtered-ping.service that uses the loaded filter:

[Unit]
Description=my egress-filtered ping service
Requires=bpf-make.service
After=bpf-make.service

[Service]
ExecStart=ping 127.0.0.1
IPEgressFilterPath=/sys/fs/bpf/port-firewall
# If you don't have systemd v243 you can use this instead of the above line:
# ExecStartPre=/path/to/bpftool cgroup attach /sys/fs/cgroup/unified/system.slice/my-ping.service egress pinned /sys/fs/bpf/port-firewall multi

Either use the IPEgressFilterPath… line if you have systemd v243 or the ExecStartPre… line as workaround with bpftool. The ping is expected to work because we filtered egress but allowed ICMP packets, UDP on the DNS port, and TCP on port 80 (HTTP but not HTTPS).

That’s it for now. I hope the systemd v243 release will be out soon and people can use this small program here to add port filters to their services or as inspiration to write their own BPF filters. In my head I have ideas for a bandwidth throttling filter and a PCAP packet dumping filter.

UPDATE: Systemd unit templates

The above version of the loader unit file has to be copied for every filter. Through a unit template file we can avoid this (required an update in the makefile):

[Unit]
Description=BPF port-firewall load service template for filter: %I

[Service]
Type=oneshot
RemainAfterExit=yes
# If bpftool is not installed system-wide use: Environment="PATH=/bin:/usr/bin:/path/to/bpftool-folder"
Environment='FILTER=%I'
Environment='BPFNAME=%i'
ExecStart=/usr/bin/make -C /path/to/repo/bpf-cgroup-filter/port-firewall
ExecStop=/usr/bin/make -C /path/to/repo/bpf-cgroup-filter/port-firewall remove
LimitMEMLOCK=infinity

Templates unit behave like regular units but carry the parameter in the unit name, e.g., "bpf-firewall@icmp || (udp && dst_port == 53) || (tcp && dst_port == 80).service". The parameter ends up being encoded with systemd-escape and the makefile will use this as program name in the BPF filesystem.

We could now use this in the Requires and After sections and the encoded filter string in IPIngressFilterPath/IPEgressFilterPath. But that means that every change in the filter needs three changes in the file.

Here is a hacky solution, making the final service a template unit itself so that the filter has to be specified as part of the service name:

[Unit]
Description=my egress-filtered ping service template
# To avoid specifying the FILTER here twice and below again,
# this service file is a template and the FILTER has to be
# passed as argument (referenced with %i) when instanciating
# the service via `systemctl start "service-with-filter@FILTER.service"`
Requires=bpf-firewall@%i.service
After=bpf-firewall@%i.service
# The alternative is to not use a template file and include the argument directly
# here as bpf-firewall@ESCAPED.service with ESCAPED being the output
# of `systemd-escape "FILTER"`.
# Then you can make this file here a regular service file without the @.

[Service]
ExecStart=ping 127.0.0.1
IPEgressFilterPath=/sys/fs/bpf/%i

# If you don't have systemd v243 you can use this instead of the above line:
# ExecStartPre=/path/to/bpftool cgroup attach /sys/fs/cgroup/unified/system.slice/system-service\x5cx2dwith\x5cx2dfilter.slice/%n egress pinned /sys/fs/bpf/%i multi
# Cannot use %p here but have to use 'my\x5cx2dping\x5cx2dwith\x5cx2dfilter' (encoded twice with systemd-escape) because the cgroup fs slice path name still has the escaping and if we use the escaping here once it is reverted once and thus removed when the unit is loaded.

# Again, if this file is not a template, instead of %i use
# IPEgressFilterPath=/sys/fs/bpf/ESCAPED
# Without systemd v243 it would become
# ExecStartPre=/path/to/bpftool cgroup attach /sys/fs/cgroup/unified/system.slice/%n egress pinned /sys/fs/bpf/ESCAPEDTWICE multi
# with ESCAPEDTWICE being the output of `systemd-escape "ESCAPED"`.

With that file for our final service to start, the filter has to be specified as part of the service: systemctl start "service-with-filter@icmp || (udp && dst_port == 53) || (tcp && dst_port == 80).service". Not beautiful but it does its job. Ideally I would store the filter in a BPF map so that it can be set after BPF program loading, allowing it to be a simple ExecStartPre line in the service file.