40 millisecond bug

Originally published on the author’s private blog.

This is a small story about tracking down a production bug in a back-end service written in the Rust language. It probably doesn’t provide much learning opportunities (unless you happen to hit the exact same bug), but might give a glimpse to a day of few computer engineers and supply a bit of entertainment.

A bit of backstory

We have an application called urlite. It serves as a backend to some other applications, provides them with a HTTP API. It’s in Rust because it is latency critical. Latencies of most requests are under a millisecond.

It uses the hyper and tokio libraries to handle HTTP. Some time ago, the libraries released a new version, to incorporate support for Rust’s new shiny async-await syntax. We were a bit late to the updating to the newer versions. Part of it was that it was working fine and we had only one short asynchronous function around, so there wasn’t real motivation. The other part was, urlite uses the spirit configuration layer (I guess the fact urlite uses it has something to do with me being author of spirit). The library takes a configuration file and makes sure all the other libraries for logging, HTTP, etc, are configured according to it.

And migrating spirit to new hyper and tokio was a big task, as both hyper and tokio changed quite a bit between the versions and because spirit does some things that are not usual for ordinary applications.

Eventually, it was decided keeping up to date with dependencies is important enough for me to invest some of the work time (as opposed to the usual free time I use to work on spirit) to do the update. It took about a week to migrate both spirit and urlite. It went through review, the automatic tests and we put it to the staging environment for a while, watching logs and graphs. Everything seemed fine, so after few days of everything looking fine, we pushed the button and put it to production.

The increased latencies

As it goes in these kinds of stories, by now you’re expecting to see what went wrong.

The thing is, our own metrics and graphs were fine. But the latency on the downstream service querying us increased by 40ms. The deployment got reverted, and we started to dig into where these latencies come from.

It was acting really weird

There were several very suspicious things about that.

Our own „internal“ latencies stayed the same. Our CPU usage also stayed the same.
The latency graph on the downstream side was flat 40ms constant.

Now, if we introduced some performance regression in the query handling, we would expect our CPU consumption to rise. We would also expect the latency graph to be a bit spiky, not completely flat 40ms constant. It almost looked like there was a 40ms sleep somewhere. But why would anyone put a 40ms sleep anywhere?

I’ve looked through documentation and didn’t see anything obvious. I’ve tried searching both our code and code of the dependencies for 40, asked on Slack if that 40ms value was familiar to anyone. Nothing.

The working theory we started with was that there could be some kind of back-off sleep on some kind of failure. Maybe hyper would be closing inactive connections in the new version, forcing the application to reconnect (the graph was for 99th percentile, so if we happened to close each 100th connection and reconnection took this long…) and maybe try IPv6 first and we would be listening on IPv4 only or… (in other words, we didn’t have a clear clue).

The benchmarks

My colleague started to investigate in a more thorough way than just throwing ideas around on Slack. He run a wrk benchmark against the service. On his machine, the latencies were fine. So he commandeered one of the stage nodes to play with it and run the benchmark there. And every request had 40ms latency on that machine. The previous version of urlite was fine, with under one millisecond.

Something was probably sleeping somewhere on the production servers, but not on the development machine. There probably was some difference in the OS settings, but definitely difference in the kernel version. The servers are running some well-tried Linux distribution, so they have a lot older version, while a developer is likely to run something much more on the edge.

Configuration options

The selling point of spirit is that most of the thing one can set through the APIs of various libraries now can be just put into a config file without any recompilation. If we did configuration by hand, we would expose only the things we thought would be useful, but with spirit, there’s everything. So, naturally, tweaking some of the config knobs was the next step (because it was easy to do).

It was discovered that turning the http1-writev option off, which corresponds to this method in hyper, made the latencies go away.

We now had a solution, but I wasn’t happy about not understanding why that helped, so I went to dig the rabbit hole and find the root cause. Turns out there were several little things in just the constellation to make the problem manifest.

Overriding the defaults of `http1_writev`

The method controls the strategy in which way data are pushed into the socket. With vectored writes enabled, it sends two separate buffers (one with headers, the other with the body if it’s small enough) through a single writev syscall. If they are disabled, hyper copies all the bytes into a single buffer and sends that as a whole.

It turns out that the configuration method takes a bool, but there are actually 3 states. The third (auto) is signaled by not calling the method and spirit-hyper was calling it always, with the default to turn the writev on. I don’t know if leaving it on auto would make the bug go away, but I’ve fixed the problem in the library anyway.

Splitting vectored writes

Spirit wants to support a bit of configuration on top of what the underlying libraries provide on their own. One of such things is limiting the number of concurrently accepted connections on a single listening socket. Users don’t have to take advantage of that (the types for the configuration can be composed together to either contain that bit or not and the administrator may leave the values for the limits unset in the configuration and then they won’t be enforced).

Anyway, in case the support for the limits is opted in through using the type with the configuration fields, the connections themselves are wrapped in a Tracked type. The type tries to be mostly transparent for use and can be used inside hyper (which is what the default configuration type alias in spirit-hyper does), but tracks how many connections there are, to not accept more if it runs out of the limit.

When implementing the bunch of traits for the wrapper, I’ve overlooked the AsyncWrite::poll_write_buff. It is a provided method, which means it has a default implementation. It is the one that abstracts the OS-level writev, it can write multiple buffers (the Buf represents a segmented buffer).

The default implementation simply delegates to multiple calls to the ordinary write. Therefore, this omission combined with the default of enabling vectored writes results in calling into the kernel twice, each time with a small buffer, instead of once with two small buffers or once with a big buffer.

That was definitely something to get fixed, because if nothing else, syscalls are expensive and calling more of them is not great. But I’ve finally felt like I’m on the right path, because:

Nagle’s algorithm

You probably know that TCP stream is built from packets going there and back. Optimizing how to split the stream into the packets and when to send them is a hard problem and the research in that area is still ongoing, because there are many conflicting requirements. One wants to deliver the data with low latency, utilize the whole bandwidth, but not overflow the capacity of the link (in which case the latencies would go up or packets would get lost and would have to be retransmitted), leave some bandwidth to other connections, etc.

The Nagle’s algorithm is one of the older tricks up in TCP’s sleeve. The network doesn’t like small packets. It is better to send as large packets as the link allows, because each packet has certain overhead ‒ the headers take some space, but also routers spend computation power mostly based on number of packets and less on their size. If you start sending a lot of tiny packets, the performance will suffer. While the links are limited by the number of bytes that can pass through them, routers are more limited by the number of packets. If a router declares to be able to handle a gigabit connection, it actually means a gigabit if you use full-sized packets, but will get to few megabits if you split the data into tiny packets (at least for the cheap routers, there are some that are capable of handling even a gigabit of small ones).

So it would be better to wait until the send buffer contains a packetfull of data before sending anything. But we can’t do that, because we don’t know there’ll ever be a full packet of data, or generating more data might wait for the other side to answer. We would never send anything and wait forever and the Internet would not work.

Instead, the algorithm is willing to send one undersized packet and then it waits for an ACK from the other side until sending another undersized one. If it gets a packetfull of data to send in the meantime, that’s great and it sends it (it won’t get better by waiting longer), but it won’t ever have two undersized ones somewhere in flight, therefore won’t kill the network’s performance by them.

This works, it’ll make progress eventually because once the ACK comes and all the data that accumulated until then is sent.

But it also slows things down. Like in our case. What happens if we do it using single syscall, the whole HTTP response forms a single undersized packet (we have really small answers) and gets sent, no matter if it’s submitted to the kernel by one big or two small buffers.

On the other hand, if we split the submission into two syscalls, this is what happens:

We write the first part (headers). The kernel sends them out as an undersized packet.
We write the second part (the body). But the kernel shelves them into the send buffer and waits for sending them until it sees the ACK, because there’s one undersized packet somewhere in flight already. Here is our sleep.

This theory got confirmed by toggling the tcp-nodelay option on (and leaving writev on too, without the fix in the library applied), which also solved our problem. This option turns the Nagle’s algorithm off and just sends all data out as soon as possible (considering other things like buffers in the network card, TCP windows, congestion algorithms…).

Ok, but why 40ms?

But wait, the other service lives on the same machine. It really can’t take that long for the packet to pass through the loopback interface and an ACK to return back. Hell, it should not take 40ms if it was a different datacenter on the same continent. Not speaking about this taking exactly 40ms. Even if we had really long and slow loopback, it would not turn out to be 40ms exactly each time.

So there must be something more happening.

And indeed, there is. It’s called delayed ACKs and it’s also an optimisation of TCP. One that got introduced about the same time as Nagle’s algorithm, but without knowing about each other or considering how they would interact.

You see, one doesn’t want to send small packets (deja-vu?). A packet carrying only an ACK and no data is certainly small and we don’t want to send it if we can help that.

As an ACK can be used to acknowledge multiple received packets at once and it can piggy-back on a packet with actual data in it, the receiver wants to wait a little bit. The idea is, when data is received, it often provokes the application to respond to it with some other data so we wait for that to happen, then send the data with the ACK. If not, we at last wait for a second packet, to acknowledge two at once. Only if we don’t get any response data nor more incoming packets we decide to send that ACK eventually. After all, the sender doesn’t really need that ACK fast, does it, because if it would have more data to send, it would send at least the whole TCP window, which is several packets long.

Except that the other side might be waiting for it to send more data because of the Nagle’s algorithm. That sucks.

And indeed, a little bit of searching through the Internet actually finds a kernel patch talking about delayed ACKs and the value of 40ms. Bingo, this can’t be coincidence.

I have reasons to believe this interaction got somehow solved in newer kernels (I don’t know how, but apparently, it no longer happens on our development machines, at least with loopback). And that 40ms timeout is something the client side chooses.

Conclusion

Well, sometimes the situation happens to be a really complex mess that results in some weird interaction (and make a good material for a blog post). The above combination of the TCP optimisations is described on the Internet, but I’d say it’s not widely known one. That getting triggered through defaulting to a less optimal but generally good enough implementation is even bigger chance. And that such a small detail can make your optimized service 40 times slower.

Nevertheless, our takeaway, apart from fixing the discovered issues is:

We started to watch not only our own graphs, but graphs of other services around and have alerts for them. Our own graphs just can’t contain delays introduced by packets sleeping somewhere in the kernel.
We now run with tcp-nodelay enabled. Even when the bug is now fixed, the interaction is still technically possible to hit in case the client uses pipelining ‒ the client could send two HTTP requests in a row without waiting for the first response in between, possibly in the same TCP packet. Then we would have written the first response and hit the problem after writing the second one.

Rust is generally known to catch a lot of bugs during compile time. But no matter how good your language, tooling and tests are, some complex combination of chances still can cause bugs nobody foresees. Then it’s up to how well the rest of the system deals with it, how easy it is to roll back to previous versions and, as always, some luck and intuition while debugging (toggling the writev option was only a hunch at best, for example).

A bit of backstory

The increased latencies

It was acting really weird

The benchmarks

Configuration options

Overriding the defaults of `http1_writev`

Splitting vectored writes

Nagle’s algorithm

Ok, but why 40ms?

Conclusion

Menu

Recent Posts

Archives

Categories

40 millisecond bug

A bit of backstory

The increased latencies

It was acting really weird

The benchmarks

Configuration options

Overriding the defaults of http1_writev

Splitting vectored writes

Nagle’s algorithm

Ok, but why 40ms?

Conclusion

Menu

Recent Posts

Archives

Categories

Overriding the defaults of `http1_writev`