3 min read

Writing highly scalable backends in UDP. The solution.

Last week I shared an interview question I've used successfully at Network Next. In this article I share the solution.
Writing highly scalable backends in UDP. The solution.

The actual question looks easy:

You are tasked with creating a client/server application in Golang that runs in Google Cloud. The client in this application must communicate with the server over UDP.

Each client sends 100 requests per-second. Each request is 100 bytes long. The server processes each request it receives and forwards it over HTTP to the backend.

The backend processes the request, and returns a response containing the FNV1a 64 bit hash of the request data. The server returns the response it receives from the backend down to the client over UDP.

Implement the client, server and backend in Golang. Provide an estimate of the cost to run the solution each month at a steady load of 1M clients, as well as some options you recommend as next steps to reduce the cost.

I think most programmers familiar with Golang could implement the code above over a weekend, it doesn't seem too hard. Time to get coding right?

No. Not at all. Turns out, it's basically fucking impossible.

The Kobayashi Maru

The Kobayashi Maru is a training exercise in the Star Trek franchise designed to test the character of Starfleet Academy cadets by placing them in a no-win scenario. The Kobayashi Maru test was first depicted in the 1982 film Star Trek II: The Wrath of Khan, and it has since been referred to and depicted in numerous other Star Trek media.

The nominal goal of the exercise is to rescue the civilian fuel ship Kobayashi Maru, which is damaged and stranded in neutral territory between the Federation and the Klingon Empire. The cadet being evaluated must decide whether to attempt to rescue the Kobayashi Maru—endangering their ship and crew—or leave the Kobayashi Maru to certain destruction. If the cadet chooses to attempt a rescue, an insurmountable enemy force attacks their vessel. By reprogramming the test itself, James T. Kirk became the only cadet to overcome the Kobayashi Maru simulation.

The phrase "Kobayashi Maru" has entered the popular lexicon as a reference to a no-win scenario. The term is also sometimes used to invoke Kirk's decision to "change the conditions of the test."

-- Wikipedia

What's important is how the programmer thinks

I've given this test to many candidates, and I've heard pretty much every response you could imagine, from "That's easy! I could code that in a few hours" to "That's impossible!". You can even see a bunch of people getting worked up over this question on the hacker news thread from when I posted it. https://news.ycombinator.com/item?id=39979078

From the answers that people give during interviews I can tell a lot about what sort of programmers they are. A particularly strong tell comes from programmers who think that they should just write it in <insert language/technology of choice> and everything would scale perfectly. It's easy if you just do X they say as hands are waved. Turns out if you want to actually solve this problem you really need to roll your sleeves up and code.

When you do this, you'll find it's such a difficult problem that even getting to 10k clients is hard. I can guarantee even an experienced programmer would work for days, potentially even weeks on a solution that hits 10k clients in Golang. It's also a very sneaky question in that and if you are not absolutely disciplined in tracking expected packets sent, vs. actual sent and received, you can easily fool yourself that you are scaling up when in fact most packets you send are dropped.

There are many tricks along the way to 10K clients: socket send and receive buffer sizes, SO_REUSEPORT, batching, sharding and being stateless, how to handle being CPU vs. IO bound, identifying bottlenecks, horizontal scaling, load balancing in Google Cloud, how the Linux kernel networking works and the many ways that UDP packets can get dropped. If you know all these things you are at an advantage – or perhaps, a disadvantage – since you'll get further along before you realize it's impossible.

Just like the Kobayashi Maru, you can learn a lot about somebody based on how they respond to the no-win scenario, and the very best programmers break the rules to pass the test.

So here is my answer. It's worth reviewing if you are curious about how to obtain high UDP request/response throughput for an application. It's not the only way to solve this problem, but it is mine:

https://github.com/mas-bandwidth/udp/blob/main/001/README.md

How would you approach it?