Sending millions of packets per-second with AF_XDP
I'm Glenn Fiedler and welcome to Más Bandwidth, my new blog at the intersection of game network programming and scalable backend engineering.
XDP is an amazing way to receive millions of packets per-second, bypassing the Linux networking stack and letting you write eBPF programs that operate on packets as soon as they come off the NIC. But, you can only write packets in response to packets received – you can't generate your own stream of packets.
In this article I'm going to show you how to use AF_XDP to generate and send millions of UDP packets per-second. You can use these packets to test components you've written in XDP/eBPF by sending packets close to line rate, making sure your XDP/eBPF programs work properly under load.
What is AF_XDP?
AF_XDP is a (relatively) new type of Linux socket that lets you send and receive raw packets directly from userspace programs. It's incredibly efficient because instead of traditional system calls like sendto and recvfrom, it sends and receives packets via lock free ring buffers.
On the receive path, AF_XDP works together with an XDP/eBPF program that decides which packets should be sent down to the AF_XDP socket, so you can make decisions like "is this packet valid?" before passing it down to userspace. But on the send side – which we'll focus on in this article – AF_XDP is completely independent of XDP/eBPF programs. It's really just an efficient way to send packets with ring buffers.
How do the ring buffers work?
First, you create an area of memory called a UMEM where all packets are stored. This memory is shared between your userspace program and the kernel, so they can both read and write to it.
The UMEM is broken up into frames, each frame being the maximum packet size: for example: 1500 bytes. So really, the UMEM is just a contiguous array of 4096 buffers, each 1500 bytes large. It's nothing complicated.
Next, you create an AF_XDP socket linked to this UMEM and associate it with two ring buffers: TX and Complete. Yes, there are two additional ring buffers used for receiving packets but let's ignore them, because thinking of all four ring buffers at the same time fries brains.
The TX ring buffer is the send queue. It just says, "hey kernel, here's a new packet to send, at this offset in the UMEM and it's this many bytes long". The kernel driver reads this TX data from the queue, and sends the packet in the UMEM at that offset and length directly to the network interface card (NIC).
Obviously, it's extremely important that you don't reuse a frame in the UMEM to send another packet until the packet in it has actually been sent, so the Complete ring buffer is a queue that feeds back the set of completed packet sends to the userspace program. You read from this queue and mark frames in the UMEM as available for new packets.
And that's really all there is to it. You write raw packets in the UMEM, send them via the TX queue, and the kernel notifies you when packet sends are completed via the Complete queue. You can read more about the AF_XDP ring buffers here.
How do raw packets work?
The packet constructed in the UMEM is sent directly to the NIC without modification, so you have to construct a raw packet including ethernet, IPv4 and UDP headers in front of your UDP payload.
Thankfully, this is relatively easy to do. Linux headers provide convenient definitions of these headers as structs, and you can use these to write these headers to memory.
Here's my code for writing a raw UDP packet:
// generate ethernet header
memcpy( eth->h_dest, SERVER_ETHERNET_ADDRESS, ETH_ALEN );
memcpy( eth->h_source, CLIENT_ETHERNET_ADDRESS, ETH_ALEN );
eth->h_proto = htons( ETH_P_IP );
// generate ip header
ip->ihl = 5;
ip->version = 4;
ip->tos = 0x0;
ip->id = 0;
ip->frag_off = htons(0x4000); // do not fragment
ip->ttl = 64;
ip->tot_len = htons( sizeof(struct iphdr) + sizeof(struct udphdr) + payload_bytes );
ip->protocol = IPPROTO_UDP;
ip->saddr = 0xc0a80000 | ( counter & 0xFF ); // 192.168.*.*
ip->daddr = SERVER_IPV4_ADDRESS;
ip->check = 0;
ip->check = ipv4_checksum( ip, sizeof( struct iphdr ) );
// generate udp header
udp->source = htons( CLIENT_PORT );
udp->dest = htons( SERVER_PORT );
udp->len = htons( sizeof(struct udphdr) + payload_bytes );
udp->check = 0;
// generate udp payload
uint8_t * payload = (void*) udp + sizeof( struct udphdr );
for ( int i = 0; i < payload_bytes; i++ )
{
payload[i] = i;
}
One thing I do above is instead of including the actual LAN IP address of the client sending packets, I increment a counter with each packet sent and use it to fill the lower 16 bits of 192.168.[x].[y].
This distributes received packets across all receive queues for multi-queue NICs, because on Linux the queue is selected using a hash of the source and destination IP addresses. Without this, throughput will be limited by the receiver, because all packets arrive on the same NIC queue.
You can see the full source code for my AF_XDP test client and server here.
Test Setup
I'm running this test over 10G ethernet:
- Two old Linux boxes running Linux Mint
- Two Intel x540 T2 10G NICs ($160 USD each): https://www.newegg.com/intel-x540t2/p/N82E16833106083
- NetGear 10G switch ($450 USD): https://www.newegg.com/netgear-xs508m-100nas-7-x-10-gig-multi-gig-copper-ports-1-x-10g-1g-sfp-and-copper/p/N82E16833122954
Results
I can send 6 million 100 byte UDP packets per-second on a single core.
If I reduce to 64 byte UDP packets (on wire, not payload – see this article) I should be able to send a total of 14.88 million per-second at 10G line rate, but I can only get up to around 10-11 million. I suspect there is something NUMA related going on with ksoftirqd running on different CPUs than the packets were queued for send on, but I don't know how to solve it.
If you know what's up here, please email glenn@mas-bandwidth.com. I'd love to be able to hit line rate with AF_XDP and share it here.